Technical  Report 
111 


AP/ 

ESD-TR-85-313 


Speech  Transformations  Based  on 
a  Sinusoidal  Representation 


T.F.  Quatieri 
R.J.  McAulay 


16  May  1986 


Lincoln  Laboratory 

MASSACHUSETTS  INSTITUTE  OF  TECHNOLOGY 

Lexington ,  Massachusetts 


Prepared  for  ihe  Department  of  the  Air  Force 
under  Electronic  Systems  Division  Contract  F19628-85-C-0002. 


Approved  for  public  release;  distribution  unlimited. 


ADPr  llc^D 


The  work  reported  in  this  document  was  performed  at  Lincoln  Laboratory,  a 
center  for  research  operated  by  Massachusetts  Institute  of  Technology,  with  the 
support  of  the  Department  of  the  Air  Force  under  Contract  F19628-85-C-0002. 

This  report  may  be  reproduced  to  satisfy  needs  of  U.S.  Government  agencies. 


The  views  and  conclusions  contained  in  this  document  are  those  of  the  contractor 
and  should  not  be  interpreted  as  necessarily  representing  the  official  policies, 
either  expressed  or  implied,  of  the  United  States  Government. 


The  ESD  Public  Affairs  Office  has  reviewed  this  report,  and 
it  is  releasable  to  the  National  Technical  Information 
Service,  where  it  will  be  available  to  the  general  public, 
including  foreign  nationals. 


This  technical  report  has  been  reviewed  and  is  approved  for  publication. 


FOR  THE  COMMANDER 


Thomas  J.  Alpert,  Major,  USAF 

Chief,  ESD  Lincoln  Laboratory  Project  Office 


Non-Lincoln  Recipients 

FLEASE  DO  NOT  RETURN 

Permission  is  given  to  destroy  this  document 
when  it  is  no  longer  needed. 


MASSACHUSETTS  INSTITUTE  OF  TECHNOLOGY 
LINCOLN  LABORATORY 


SPEECH  TRANSFORMATIONS  BASED 
ON  A  SINUSOIDAL  REPRESENTATION 


T.F.  QUATIFRI 
RJ.  McAULAY 

Group  24 


TECHNICAL  REPORT  717 

16  MAY  1986 


Approved  for  public  release;  distribution  unlimited. 


LEXINGTON 


MASSACHUSETTS 


ABSTRACT 


In  this  report,  a  new  speech  analysis/synthesis  technique  is  presented  which  provides  the 
basis  for  a  general  class  of  speech  transformations  including  time-scale  modification, 
frequency  scaling,  and  pitch  modification  These  modifications  can  be  performed  with  a 
time-varying  change,  permitting  continuous  adjustment  of  a  speaker’s  fundamental 
frequency  and  rate  of  articulation.  The  method  is  based  on  a  sinusoidal  representation  of 
the  speech  production  mechanism  that  has  been  shown  to  produce  synthetic  speech  that 
preserves  the  waveform  shape  and  is  essentially  perceptually  indistinguishable  from  the 
original.  Although  the  analysis/ synthesis  system  originally  was  designed  for  single-speaker 
signals,  it  is  equally  capable  of  recovering  and  modifying  nonspeech  signals  such  as  music, 
multiple  speakers,  marine  biologic  sounds,  and  speakers  in  the  presence  of  interferences 
such  as  noise  and  musical  backgrounds. 


TABLE  OF  CONTENTS 


Abstract  iii 

List  of  Illustrations  vii 

1.  INTRODUCTION  1 

2.  ANALYSIS/SYNTHESIS  BASED  ON  A  SINE-WAVE 

SPEECH  MODEL  5 

3.  TIME-SCALE  MODIFICATION  19 

4.  FREQUENCY  TRANSFORMATIONS  39 

5.  JOINT  TIME-FREQUENCY  MODIFICATIONS  49 

6.  DISCUSSION  53 

References  55 


v 


LIST  OF  ILLUSTRATIONS 


Figure 

No.  Page 

2. 1  Sinusoidal  Representation  of  Speech  Production  6 

2.2  Block  Diagram  of  Sinusoidal  Analysis  9 

2.3  Requirement  of  Unwrapped  System  Phase 

(a)  Interpolation  of  Wrapped  System  Phase 

(b)  Interpolation  of  Unwrapped  System  Phase  1 1 

2.4  Block  Diagram  of  Sinusoidal  Synthesis  15 

2.5  Reconstruction  of  Signal  from  Single  Speaker 

(a)  Original 

(b)  Reconstruction  16 

2.6  Reconstruction  of  Signal  from  Two  Speakers 

(a)  Original 

(b)  Reconstruction  16 

2.7  Reconstruction  of  Speech  in  a  Music  Background 

(a)  Original 

(h)  Reconstruction  17 

3.1  Time  Warping  with  Fixed  Rate  Change  p  >  1  20 

3.2  Model  of  Time-Scale  Modification  20 

3.3  Time-Scale  Expansion  of  a  Single  Sine  Wave  21 

3.4  Functional  Mappings  for  Time-Scale  Expansion  23 

3.5  Block  Diagram  of  Uniform  Rate-Change  System  24 

3.6  Flow  Diagram  of  Computer  Implementation  of  Uniform  Rate-Change 

System  26 

3.7  Time-Scale  Modification  of  Synthetic  Waveform 

(a)  Original 

(b)  Reconstruction  (p  =  1.0) 

(c)  Expansion  (p  =  1.5) 

(d)  Compression  (p  =  0.5)  27 

3.8  Time-Scale  Expansion  of  Speech 

(a)  Original 

(b)  Expansion  (p  =  2)  27 

vii 


Page 


Figure 
No. 

3.9  Time-Scale  Compression  of  Speech 

(a)  Original 

(b)  Compression  (p  =  0.5)  28 

3.10  Time-Scale  Expansion  of  Speech  in  Music 

(a)  Original 

(b)  Expansion  (p  =  1.5)  28 

3.11  Time-Scale  Expansion  of  Combined  Male  and  Female  Speech 

(a)  Original 

(b)  Expansion  (p  =  2)  29 

3.12  Piecewise  Constant  Rate  Change 

(a)  Rate  Change  Function 

(b)  Time-Warp  30 

3.13  System  Phase  Mapping  for  the  Piecewise  Constant  Rate  Change 
of  Figure  3.12 

3.14  Flow  Diagram  of  Computer  Implementation  of  Nonuniform 
Rate-Change  System 

3.15  Speech  Segments  from  the  Passage,  “She  Fell  From  the  Car,”  with 
Superimposed  Spectral  Deriviative.  The  Spectral  Derivative  Has  Been 
Normalized  to  Lie  Between  Zero  and  Unity  and  Was  Assumed 
Constant  over  an  Analysis  Frame 


4.1 

Time-Frequency  Illustration  of  Frequency  Compression 

(a)  Original 

(b)  Frequency  Compressed 

40 

4.2 

Frequency  Compression  of  the  Spectral  Magnitude 

(a)  Original 

(b)  Compression 

41 

4.3 

Sinusoidal  Model  for  Pitch  Modification 

43 

4.4 

Time-Frequency  Illustration  of  Pitch  Modification 

(a)  Original 

(b)  Pitch-Scaled  (Lowered) 

43 

31 

35 

37 


Figure 

No.  Page 

4.5  Pitch  Modification  of  Synthetic  Waveform 

(a)  Original 

(b)  Increase  in  Pitch  (/?  =  1.5) 

(c)  Decrease  in  Pitch  (/ 3  =  0.5)  44 

4.6  Pitch  Modification  of  Speech  in  the  Frequency  Domain 

(a)  Original 

(b)  Pitch-Scaled  Spectral  Magnitude  (/ 3  =  1.5)  45 

4.7  Pitch  Modification  of  Speech  in  the  Time  Domain 

(a)  Original 

(b)  Pitch-Scaled  (($  =  0.8)  46 

5.1  Joint  Frequency  Scaling  and  Time-Scale  Modification 

(a)  Original 

(b)  Frequency  Compression  and  Time-Scale  Expansion 

(c)  Inversion  of  Figure  5.1(b)  50 


IX 


SPEECH  TRANSFORMATIONS 
BASED  ON  A  SINUSOIDAL  REPRESENTATION 


1.  INTRODUCTION 

In  a  number  of  important  applications,  it  is  desirable  to  transform  a  speech  waveform  to  a 
signal  which  is  more  useful  than  the  original1.  In  time-scale  modification2,  for  example,  the  rate 
of  articulation  is  slowed  down  to  make  degraded  speech  more  comprehensible.  Alternatively, 
speech  is  speeded  up  in  order  to  quickly  scan  a  passage  or  compress  an  utterance  into  an 
allocated  time  interval.  In  other  applications,  the  speech  is  compressed  or  expanded  in  frequency. 
In  particular,  frequency  compression  is  useful  in  bandwidth  reduction3  or  in  placing  the  speech 
into  a  desired  frequency  range  as  an  aid  to  the  hearing  impaired4.  Another  application  requires 
that  the  fundamental  frequency  of  the  speaker  be  modified  while  preserving  the  shape  of  the 
envelope  of  the  short-time  speech  spectrum.  This  operation  is  useful  in  psychoacoustic  research5 
or  in  correcting  pitch  disjunctions  in  concatenated  speech  segments6.  In  a  number  of  these 
applications,  it  is  sometimes  desired  to  perform  speech  modifications  which  vary  in  time  or  to 
perform  modifications  simultaneously.  For  example,  in  time-scale  modification  it  is  of  interest  to 
have  the  means  to  continuously  adjust  a  speaker’s  rate  of  articulation,  while  in  concatenating 
speech  segments  both  the  time  scale  and  pitch  may  require  modification. 

In  this  report,  a  speech  analysis/ synthesis  system  is  presented  which  forms  the  basis  of  a 
general  class  of  such  transformations.  The  system  is  based  on  a  sinusoidal  representation  of 
speech  which  incorporates  a  model  of  speech  production,  but  which  is  independent  of  the  speech 
state  and  of  the  pitch.  The  reconstruction  requires  an  estimate  of  the  excitation  and  vocal  tract 
contributions  of  the  amplitude  and  phase  of  each  component  of  the  underlying  sine-wave  model. 
Functional  representations  of  these  parameters  are  derived  from  short-time  Fourier  transform 
samples  which  correspond  to  the  sine-wave  components.  The  resulting  analysis/ synthesis  system 
thus  represents  a  refinement  of  a  purely  sine-wave-based  analysis/ synthesis  procedure  described  in 
a  previous  report7.  The  new  system  has  been  applied  to  obtain  high-quality  time-scale 
modification,  frequency  scaling,  and  scaling  of  fundamental  frequency.  These  operations  can  be 
performed  independently  of  one  another  or  simultaneously  and  can  also  be  applied  with  time- 
varying  changes.  For  example,  a  speaker’s  pitch  can  be  continuously  changed  while  continuously 
changing  the  rate  of  articulation.  Furthermore,  the  system  does  not  break  down  either  for  a  large 
class  of  nonspeech  signals  or  for  speech  corrupted  by  interferences  such  as  a  second  speaker  or 
acoustic  background  noise  in  the  sense  that  the  background  is  not  perceived  as  different  from  the 
original. 

Numerous  other  methods  have  been  proposed  for  modification  of  the  speech  waveform.  One 
of  the  earlier  approaches,  based  on  classical  vocoders1’8,  utilizes  pitch  and  voiced/unvoiced 
decisions  in  the  excitation  and  an  estimate  of  the  vocal  tract  system  function.  Although  this 
procedure  is  suitable  for  a  wide  range  of  speech  transformations,  errors  in  pitch  and  voiced/ 
unvoiced  state  decisions  typically  introduce  artifacts  into  the  synthetic  modified  speech.  A  more 
recent  approach  that  does  not  require  pitch  extraction  and  voiced/ unvoiced  decisions  manipulates 


an  excitation  obtained  by  deconvolving  the  original  speech  with  a  vocal  tract  spectral  envelope 
estimate9.  This  procedure  thus  relies  on  the  speech  production  model,  as  do  classical  vocoders, 
but  avoids  many  of  the  problems  inherent  in  the  vocoder  approach.  Another  class  of  methods 
widely  used  in  the  application  of  time-scale  modification  is  based  on  the  Fairbank’s  method10. 
This  technique  periodically  repeats  or  discards  segments  of  the  speech  waveform,  a  method  prone 
to  boundary  discontinuities.  Refinements  of  this  technique  involve  “pitch  synchronous”  splicing  of 
the  waveform11.  A  further  improvement  was  introduced  by  Neuberg12  who  smoothly  merged 
adjacent  speech  segments  to  reduce  discontinuities.  Another  approach  to  guaranteeing  a  smooth 
synthetic  waveform  from  speech  segments  uses  an  iterative  method  for  reconstructing  the 
modified  waveform  from  the  short-time  Fourier  transform  magnitude13-  14.  Although 
computationally  burdensome  for  practical  applications,  this  method  yields  very  high-quality  rate- 
altered  speech. 

A  number  of  approaches  to  analysis/ synthesis  based  on  sine-wave  models  have  been 
discussed  in  the  literature.  Hedelin15  proposed  a  pitch-independent  sine-wave  model  for  use  in 
coding  the  baseband  signal  for  speech  data-rate  compression.  The  amplitudes  and  frequencies  of 
the  underlying  sine  waves  are  estimated  using  Kalman  filtering  techniques  and  the  sine-wave 
phases  are  obtained  by  integrating  the  instantaneous  frequencies.  The  use  of  this  system  for 
speech  transformations  was  not  explored.  Other  high-quality  systems  based  on  a  sine-wave 
representation  have  been  applied  to  time-scale  modification  of  speech16-  17  and  frequency 
scaling17.  The  system  by  Portnoff16,  a  refinement  of  the  phase  vocoder1,  in  addition,  represents 
each  sine-wave  component  by  vocal  cord  excitation  and  vocal  tract  system  contributions.  In 
contrast  to  Hedelin’s  approach  and  the  approach  taken  in  this  report,  these  analysis/ synthesis 
systems  are  based  on  an  underlying  representation  which  constrains  the  sine-wave  components  to 
be  harmonically  related.  Furthermore,  the  analysis  in  these  systems  does  not  explicitly  estimate 
the  sine-wave  components,  but  rather  views  them  as  outputs  of  a  bank  of  uniformly-spaced 
bandpass  filters.  The  synthesis  can  be  viewed  as  summing  the  modified  output  of  this  filter  bank. 
A  system  which  may  be  applicable  to  time-scale  modification  is  also  based  on  a  harmonic  sine- 
wave  model  and  uses  sine-wave  generators  explicitly  in  the  synthesizer18.  To  compensate  for  the 
inadequacies  in  the  harmonic  model,  a  residual  waveform  is  computed  which  in  turn  must  be 
time-scale  modified,  a  problem  which  has  not  yet  been  investigated. 

As  indicated  above,  in  contrast  to  earlier  sine-wave  based  systems,  the  analysis/ synthesis 
system  of  this  report  explicitly  estimates  the  amplitude  and  phase  of  the  vocal  cord  excitation 
and  vocal  tract  system  function  contributions  to  each  sine  wave.  These  estimates  are  obtained 
from  the  short-time  Fourier  transform  evaluated  at  frequencies  corresponding  to  the  location  of 
spectral  peaks.  Since  the  frequencies  of  the  sine  waves  are  not  constrained  to  be  harmonic,  pitch 
is  not  required  in  the  analysis.  The  synthesis  uses  the  amplitude  and  phase  estimates  to  obtain  a 
functional  representation  of  the  time  evolution  of  each  parameter.  This  particular  functional 
representation  is  the  key  to  all  speech  transformations  and  allows  for  a  flexibility  (i.e.,  joint  time- 
varying  transformations  )  not  present  in  other  high-quality  systems. 

This  report  is  organized  as  follows.  The  sinusoidal  speech  representation  upon  which  the 
analysis/ synthesis  system  is  based  is  described  in  Section  2.  Then  in  Section  3  the  problem  of 
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time-scale  modification  is  addressed  in  two  parts.  First  the  time  scale  is  allowed  to  change 
uniformly  and  then  the  generalization  to  a  time-varying  adjustment  of  the  time  scale  is 
developed.  The  generalized  system  is  used  in  attempting  to  more  closely  simulate  actual  rate- 
changed  speech  by  adapting  the  time-scale  to  features  of  the  speech  waveform.  In  Section  4  the 
problems  of  frequency  scaling  and  pitch  modification  are  addressed.  As  in  Section  3  both 
uniform  and  time-varying  modifications  are  considered.  Finally,  in  Section  5  the  time  and 
frequency  modifications  are  combined  into  a  single  system  capable  of  performing  the  operations 
jointly. 
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2.  ANALYSIS/SYNTHESIS  BASED  ON  A  SINE-WAVE  SPEECH  MODEL 


In  this  section,  the  sinusoidal  representation  of  speech  presented  in  Reference  2  is  first 
reviewed.  A  new  analysis/synthesis  system  is  then  developed  which  refines  the  analysis /synthesis 
procedure  in  Reference  2  by  separating  the  vocal  cord  excitation  and  vocal  tract  system 
contributions  underlying  each  component  of  the  sine-wave  model. 

2.1  The  Sinusoidal  Representation 

In  the  speech  production  model,  the  speech  waveform  s(t)  is  assumed  to  be  the  output  of 
passing  a  glottal  excitation  waveform  e(t)  through  a  linear  time-varying  system  with  impulse 
response  h(t,  r),  representing  the  characteristics  of  the  vocal  tract.  Mathematically,  this  can  be 
written  as 


s(t)  =  X  h(t,  t  -  7)e(r)dr  (2. 1 ) 

0 

The  excitation  will  be  represented  by  a  sum  of  sine  waves  of  arbitrary  amplitudes,  frequencies, 
and  phases: 

MO 

e(t)  =  2  a2(t)cos[n?(t)]  (2.2a) 

G=1 


where 


n2(t)  -  Va(t)  +  <h  (2.2b) 

with 

t 

va(t)  =  J  w2(o)do  (2.2c) 

«2 

where  t^ is  the  onset  time  of  the  2th  sine  wave  and  where  L(t)  is  the  number  of  sine-wave 
components  at  time  t.  For  the  2th  component,  a^(t)  and  wj(t)  are  the  slowly  time-varying 
amplitude  and  frequency  (i.e.,  the  parameters  are  essentially  contant  over  a  20-30  ms  analysis 
window),  V^(t)  is  the  changing  contribution  to  the  excitation  phase  [e.g.,  for  a  steady-state  tone, 
Vg(t)  is  a  linear  ramp,]  and  <f>%  is  the  fixed  phase  offset  which  accounts  for  the  fact  that  the  sine 
waves  will  generally  not  be  in  phase.  The  vocal  tract  transfer  function  is  given  by  the  Fourier 
transform  of  the  system  response  h(t,  r)  for  each  t  and  will  be  denoted  by  H(w,  t): 

H(cu,  t)  =  M(cu,  t)exp[j4>(ft>,  t)]  (2.3) 

where  M(oj,  t)  and  <F(w,  t)  are  the  amplitude  and  phase  of  H(w,  t),  respectively.  With  these 
definitions,  the  speech  production  mechanism  can  be  represented  in  the  frequency  domain,  as 
depicted  in  Figure  2-1.  Since  the  vocal  tract  system  function  is  linear,  each  sine-wave  component 
of  the  excitation  is  independently  affected  by  the  system  function. 
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Figure  2-1.  Sinusoidal  representation  of  speech  production. 

This  approach  to  modeling  the  excitation  and  vocal  tract  leads  to  a  particularly  simple 
representation  of  the  speech  waveform.  Let  the  amplitude  and  phase  of  the  system  function  along 
each  frequency  track  a>2(t)  of  the  excitation  be  denoted  by 

M2(t)  =  M[a>2(t),  t]  (2.4a) 

and 

<*>2(t)  =  4>[co2(t),  t]  (2.4b) 

Then  using  (2.1)  through  (2.4)  results  in  the  sinusoidal  representation7 
L(t) 

s(t)  =  Aj(t)cos[02(t)]  (2.5a) 

where 

A2(t)  =  a2(t)M2(t)  (2.5b) 

and 

02(t)  =  fl2(t)  +  4>2(t) 

=  V2(t)  +  d»2  +  4>2(t)  (2.5c) 

represent  the  amplitude  and  phase  of  each  sine-wave  component  along  the  frequency  track  a>2(t). 
The  accuracy  of  this  representation  is  subject  to  the  caveat  that  the  parameters  are  slowly  varying 
relative  to  the  duration  of  the  vocal  tract  impulse  response. 

2.2  Analysis 

The  objective  of  the  analysis  is  to  estimate  the  model  parameters  (2.5)  at  an  analysis  frame 
rate  sufficient  to  track  articulatory  changes,  typically  5-20  ms.  The  procedure  begins  with 
estimating  from  a  high-resolution  spectral  analysis  the  frequencies  o>2(t)  and  the  composite 
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amplitudes  A^(t)  and  phases  0p(t)  at  the  analysis  frame  rate.  The  second  step  in  analysis  separates 
the  system  and  excitation  components  of  (2.5b)  and  (2.5c). 

Let  x(n)  represent  samples  of  the  speech  waveform,  w(n)  the  analysis  window  and  R  the 
analysis  frame  rate  in  samples,  where  the  time  sampling  interval  is  assumed  normalized  to  unity. 
Since  the  measured  speech  waveform  is  to  be  processed  digitally,  the  sampled  data  notation  is 
used  primarily  throughout  this  section.  The  frequencies  of  the  glottal  excitation  e(n)  in  (2.2)  at 
time  kR,  associated  with  the  kth  analysis  frame,  are  chosen  to  correspond  to  the  L(kR)  largest 
peaks  in  the  magnitude  of  the  short-time  Fourier  transform,  |X(o>,  kR)|,  where 

X(a»,  kR)  =  X  w(kR  -  m)x(m)exp(  jm  o>)  (2.6) 

m 

is  the  Fourier  transform  of  the  windowed  speech  segment  w(kR  -  n)x(n).  The  window  sequence 
w(n)  is  Hamming  in  shape  and  is  nonzero  over  a  range  0^  n<  N  corresponding  to  typically 
between  20  to  30  ms.  In  practice,  the  window  duration  is  adaptive,  being  set  to  2.5  times  the 
speaker’s  measured  average  pitch.  A  minimum  window  width  of  20  ms  is  used  to  guarantee 
adequate  representation  of  unvoiced  speech.  Although  ideally  provision  might  be  made  for 
making  the  window  duration  a  function  of  the  instantaneous  pitch,  this  refinement  was  found 
unnecessary  for  high-quality  synthetic  speech.  The  number  of  peaks  L(kR)  is  typically  about  40 
to  60  over  a  4  kHz  range.  The  maximum  number  of  peaks  that  can  be  specified  is  limited  by  a 
threshold  that  is  also  a  function  of  the  measured  average  pitch.  In  particular,  the  maximum 
number  of  peaks  was  set  at  60  peaks  for  a  low-pitch  speaker  (60-100  Hz),  50  peaks  for  a 
medium-pitch  speaker  (100-200  Hz),  and  40  peaks  for  a  high-pitch  speaker  (200-300  Hz).  In 
general,  the  performance  was  affected  by  the  choice  of  this  threshold  only  when  too  few  peaks 
were  allowed.  The  locations  of  the  largest  peaks  were  estimated  by  simply  searching  for  a  change 
of  slope  from  positive  to  negative  in  the  uniformly  spaced  samples  of  the  short-time  Fourier 
transform  magnitude  computed  using  the  Discrete  Fourier  Transform  (DFT).  In  practice,  the 
DFT  was  evaluated  using  a  512-point  Fast  Fourier  Transform  (FFT)  which  gave  adequate 
frequency  resolution. 

The  amplitudes  and  phases  (modulo  2  rr)  of  the  component  sine  waves  are  given  by  the 
appropriate  samples  of  the  high-resolution  DFT  corresponding  to  X(o>,  kR)  at  the  chosen 
frequencies.  Specifically,  if  to  k  is  the  2th  frequency  estimate  on  the  kth  analysis  frame,  i.e., 

-  oi^(kR)  (2.7) 

where  “  ^  ”  denotes  estimate,  then  the  corresponding  estimated  amplitudes  and  phases,  denoted  by 

■'*'  k  ^  k 

A  and  0  ,  respectively,  are  given  as 

AJ  =|X(w£,  kR)|  (2.8a) 

and 

0$  =  arg[X(a»j,  kR)]  (2.8b) 

where  “arg”  denotes  principal  phase  value.  In  Reference  7  this  estimator  was  shown  to  have 
certain  robustness  properties. 
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The  next  step  in  the  analysis  is  to  decompose  the  measured  amplitudes  A2  and  phases  8, 
into  vocal  track  and  excitation  components  whose  amplitudes  and  phases  are  combined  as  in 
(2.5b)  and  (2.5c),  respectively.  The  approach  is  to  first  obtain  at  each  analysis  frame,  estimates  of 
the  vocal  tract  amplitude  and  phase  as  functions  of  frequency,  i.e.,  M(w,  kR)  and  4>(w,  kR)  (In 
practice,  of  course,  only  uniform  samples  from  the  DFT  are  available.).  It  will  be  shown  below 
that  the  method  of  homomorphic  deconvolution  is  particularly  appropriate  for  obtaining  these 
estimates  in  the  context  of  the  sine-wave  representation.  Assuming  that  the  functions  of  frequency 
M(w,  kR)  and  4>(w,  kR)  have  been  determined,  then  the  system  amplitude  and  phase  estimates  at 

A  1/ 

the  selected  frequencies  are  given  by: 

Mk  =  M(c^,  kR)  (2.9a) 

and 

i2k  =  3>(o>k,  kR)  (2.9b) 

Finally,  the  excitation  parameter  estimates  at  each  analysis  frame  boundary  are  obtained  as 

af  =  Aj/M2k  (2.10a) 

and 

n2k  =  d\-*l  .  (2.10b) 

A  block  diagram  of  the  primary  steps  in  the  analysis  is  depicted  in  Figure  2-2.  The  dotted  lines 
stemming  from  the  frequency  estimator  indicate  that  the  frequency  selection,  amplitude  division 
and  phase  subtraction  in  (2.9)  and  (2.10)  are  performed  at  only  the  estimated  frequencies  or. 

The  remaining  problem  is  to  estimate  M(a>,  kR)  and  3>(a>,  kR)  as  functions  of  frequency 
from  the  high  resolution  short-time  Fourier  transform  X(a>,  kR).  There  exist  a  number  of 
established  ways  for  separating  out  the  system  magnitude  from  the  high-resolution  spectrum,  such 
as  all-pole  modeling19  and  homomorphic  deconvolution20.  The  phase  separation  problem,  on  the 
other  hand,  is  less  well-understood  and  thus  more  difficult  to  solve21.  However,  if  the  vocal  tract 
transfer  function  is  assumed  to  be  minimum  phase  then  the  logarithm  of  the  system  magnitude 
and  the  system  phase  form  a  Hilbert  transform  pair.  With  this  constraint,  a  phase  estimate 
4>(a»,  kR)  can  thus  be  derived  from  the  logarithm  of  a  magnitude  estimate  M(w,  kR)  through  the 
Hilbert  transform20.  Furthermore,  the  resulting  phase  estimate  will  be  smooth  and  unwrapped  as 
a  function  of  frequency,  a  property  that  will  be  useful  in  performing  speech  synthesis.  Although 
this  minimum  phase  condition  considerably  simplifies  the  problem,  the  condition  holds  only 
approximately  since  the  vocal  tract  transfer  function  may  contain  zeros  outside  the  unit  circle  in 
the  z-plane. 

It  follows  that  homomorphic  deconvolution  is  particularly  well-suited  to  the  above  estimation 
problem  since  an  estimate  of  the  system  amplitude  from  the  high-resolution  spectrum  and  the 
computation  of  the  Hilbert  transform  from  this  amplitude  estimate  can  be  performed 
simultaneously  in  this  technique20.  The  Fourier  transform  of  the  logarithm  of  the  high-resolution 
magnitude  is  first  computed  to  obtain  the  “cepstrum”.  In  order  to  remove  the  effects  due  to 


8 


152722-ISI-02 


pitch,  a  right-sided  window  with  duration  proportional  to  the  average  pitch  period,  is  then 
applied.  The  imaginary  component  of  the  inverse  Fourier  transform  of  the  resulting  sequence  is 
the  desired  phase  and  the  real  part  is  the  smooth  log-magnitude.  In  practice,  uniformly  spaced 
samples  of  the  Fourier  transform  are  computed  with  the  FFT  whose  length  was  chosen  to  be 
512  points  which  was  sufficiently  large  to  avoid  aliasing  in  the  cepstrum.  Thus,  as  illustrated  in 
Figure  2-2,  the  high-resolution  spectrum  used  to  estimate  the  sine-wave  frequencies  is  also  used  to 
estimate  the  vocal-tract  system  function. 


x(n) 


A.  A.  a.  a. 

<t*  ak  njk  ajJ 
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SYSTEM  PARAMETERS 
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Figure  2-2.  Block  diagram  of  sinusoidal  analysis. 
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2.3  Synthesis 

In  speech  synthesis,  the  goal  is  to  reconstruct  an  approximation  that  is  as  “close  as  possible” 
to  the  original  speech.  In  the  context  of  the  sine-wave  representation  of  the  speech  production 
mechanism,  the  synthesis  first  requires  the  matching  of  samples  of  the  excitation  and  vocal  tract 
contributions  of  each  sine  wave  computed  at  consecutive  frame  boundaries.  The  matching 
procedure  is  followed  by  interpolation  of  the  resulting  pairs  of  amplitude  and  phase  samples  of 
the  excitation  and  vocal  tract  functions  and,  lastly,  the  generation  of  sine  waves  based  on  the 
interpolated  components. 

The  first  step  can  be  accomplished  by  associating  the  excitation  frequencies  measured  on  one 
frame  with  those  obtained  on  a  successive  frame.  Since  the  excitation  and  system  amplitudes  and 
phases  are  specified  at  the  excitation  frequencies,  the  matching  of  these  parameters  over 
consecutive  frames  follows  directly.  An  algorithm  for  matching  the  location  of  spectral  peaks  was 
proposed  for  use  in  the  synthesis  procedure  in  Reference  7  which  uses  a  purely  sine-wave-based 
model  (i.e.,  the  excitation  and  system  contributions  of  each  sine-wave  component  are  not 
explicitly  represented).  However,  since  the  matching  requirements  are  similar,  the  matching 
algorithm  in  Reference  7  can  also  be  used  here.  The  essence  of  the  procedure  is  a  nearest- 
neighbor  association  of  frequencies.  However,  in  practice,  the  location  of  spectral  peaks,  and  thus 
of  the  frequencies,  will  change  as  the  pitch  changes  and  there  will  be  rapid  changes  in  both  the 
location  and  number  of  peaks  corresponding  to  rapidly-varying  regions  of  speech,  due  to 
voiced/ unvoiced  transitions  and  to  voiced  fricatives  for  example.  Since  the  nearest-neighbor 
association  of  frequencies  is  not  sufficient  to  account  for  such  rapid  spectral  movements,  the 
frequency  matching  also  incorporates  a  birth-death  process  of  the  component  sine  waves7.  As  a 
result  of  the  matching  algorithm,  all  of  the  amplitudes  and  phases  of  the  excitation  and  system 
components  measured  for  an  arbitrary  frame  k  at  frequencies  cok  are  associated  with  a 
corresponding  set  of  parameters  for  frame  k  +  1. 

The  next  step  in  the  synthesis  is  to  interpolate  the  matched  excitation  and  system  parameters 
over  a  frame  duration.  The  interpolation  procedures  are  based  on  the  assumption  that  the 
excitation  and  system  functions  are  “slowly-varying”  across  each  frame  along  frequency  tracks 
ai^(t).  It  follows  that  this  slowly-varying  constraint  implies  a  slowly-varying  excitation  and  system 
amplitude,  and  hence  it  suffices  to  interpolate  samples  of  these  functions  linearly  over  a  frame 
duration.  Letting  Mk  and  Mk+1  denote  a  successive  pair  of  system  amplitude  estimates  for  the  SLth 
frequency  track,  then  the  system  amplitude  estimate  across  the  kth  frame  is  given  by 

M2(t)  =  Mj  +  (Mj+1  -  M{)t/T  (2.11a) 

where  T  is  the  frame  duration  and  0^  t  ^  T  is  the  time  into  the  kth  frame.  (Note  that  for 

A 

simplicity  the  k  dependence  of  M$(t)  has  not  been  made  explicit.  Throughout  the  remainder  of 
the  report,  the  dependence  of  functional  estimates  on  the  frame  number  k  is  generally  assumed.). 
Likewise,  the  excitation  amplitude  estimate  a^(t)  over  the  kth  frame  is  given  by 

aft(t)  =  aj  +  (afik+I-aJ)t/T  (2.11b) 
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where  a  and  a ,  denote  a  pair  of  excitation  amplitude  estimates.  Since  the  analysis  and 
synthesis  is  performed  digitally,  in  practice  only  samples  of  the  continuous-time  functions  Mj(t) 
and  a$(t)  are  computed  in  obtaining  a  discrete-time  realization  of  the  synthetic  waveform. 
Nevertheless,  the  continuous-time  functional  representation  (2.11)  is  given  here  since  it  provides 
the  key  to  performing  time-scale  modification  in  Section  3. 

Since  the  vocal  tract  system  is  assumed  slowly-varying  over  consecutive  frames,  it  is 
reasonable  to  assume  that  its  phase  is  slowly-varying  as  well  and  thus  linear  interpolation  of  the 
system  phase  samples  will  also  suffice.  However,  the  characteristic  of  “slowly-varying”  is  more 
difficult  to  achieve  for  the  system  phase  than  for  the  system  magnitude.  This  is  because  an 
additional  constraint  must  be  imposed  on  the  measured  phase;  namely  that  the  phase  be  smooth 
and  unwrapped  as  a  function  of  frequency  at  each  frame  boundary.  This  requirement  is 
illustrated  in  Figure  2-3.  There  it  is  shown  that  if  the  system  phase  is  obtained  modulo  2  7r,  then 
linear  interpolation  can  result  in  a  falsely  rapidly-varying  system  phase  between  frame  boundaries. 


Figure  2-3.  Requirement  of  unwrapped  system  phase,  (a)  Interpolation  of  w  rapped  system  phase, 
(b)  Interpolation  of  unwrapped  system  phase. 


The  importance  of  the  use  of  a  homomorphic  analyzer  in  the  previous  section  is  now  evident. 

The  system  phase  estimate  $4(01,  kR)  derived  from  the  homomorphic  analysis  is  unwrapped  in 
frequency  and  thus  is  slowly-varying  when  the  system  amplitude  (from  which  it  was  derived)  is 
slowly-varying.  Linear  interpolation  of  samples  of  this  function  then  results  in  a  phase  trajectory 
which  reflects  the  underlying  vocal  tract  movement.  The  interpolation  scheme  is  similar  to  that  in 
(2.11),  and  is  given  by 

*i(0  =  +  (i2k+1  -  *J)t/T  (2.12) 

where  as  before  0  ^  t  ^  T  is  the  time  into  the  kth  frame  and  where  in  a  discrete-time  realization 
only  samples  of  (2.12)  are  required. 

Unfortunately,  such  a  simple  approach  cannot  be  used  to  interpolate  the  phase  and 
frequency  of  the  excitation.  Since  the  phase  of  X(co,  kR)  in  (2.8b)  is  measured  modulo  2  7r,  then 
the  excitation  phase  fl  in  (2.10b)  may  contain  2  7 r  discontinuities.  Thus  in  interpolating  the 
excitation  phase,  phase  unwrapping  must  be  performed.  In  addition,  since  the  excitation  phase  is 
the  integral  of  the  instantaneous  frequency  as  seen  in  (2.2b),  the  interpolation  must  yield  a  phase 
which  is  consistent  with  the  frequencies  measured  at  each  frame  boundary.  This  problem,  which 
was  originally  addressed  in  Reference  7,  was  solved  by  using  a  cubic  polynomial  for  the 
interpolation  function,  namely 

ft*(t)  =  a  +  bt  +  ct2  +  dt3  (2.13) 

with  t  =  0  corresponding  to  frame  k  and  t  =  T  corresponding  to  frame  k  +  1.  Thus  as  before  t 
represents  the  time  into  the  kth  frame.  Since  the  instantaneous  frequency  is  the  derivative  of  the 
phase,  then 

oi2(  t)  =  fifi(t)  =  b  +  2ct  +  3dt2  .  (2.14) 

The  solution  requires  constraining  the  cubic  function  (2.13)  and  its  derivative  (2.14)  to  equal  the 
excitation  phase  and  frequencies  measured  at  the  frame  boundaries.  The  notion  of  applying  a 
cubic  polynomial  to  interpolate  the  excitation  phase  between  frame  boundaries  was  independently 
proposed  by  Almeida  and  Silva18  for  use  in  their  harmonic  sine-wave  synthesizer.  However,  since 
only  the  principal  value  of  the  phase  can  be  measured,  provision  must  be  made  for  unwrapping 
the  phase  subject  to  the  above  constraints  on  the  cubic  phase  interpolation  function.  This  leads 
to  invoking  an  additional  constraint  which  requires  that  the  unwrapped  phase  be  maximally 
smooth.  The  criterion  of  “smoothness”  is  defined  as  the  minimization  of  the  second  derivative 

A 

of  flj(t)  over  the  analysis  frame  duration.  This  approach  is  similar  to  the  method  for 
interpolating  samples  of  the  estimated  composite  phase  0j(t)  that  was  developed  in  Reference  7. 
The  resulting  phase  function  not  only  satisfies  all  the  endpoint  constraints,  but  also  resolves  any 
2  7r  phase  ambiguities,  thus  unwrapping  the  excitation  phase  in  time  along  each  frequency  track. 
Note  that  since  the  excitation  frequency  equals  the  phase  derivative,  a  quadratic  frequency 
trajectory  can  be  computed  directly  from  this  procedure. 

Since  the  above  analysis  began  with  the  assumption  of  initial  estimates  of  the  excitation  and 
the  system  amplitudes  and  phases  corresponding  to  the  start  of  frame  k,  it  is  necessary  to  specify 
the  initialization  of  the  frame  interpolation  procedure.  At  the  birth  of  a  track,  since  a  matched 
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peak  was  not  found  at  frame  k,  the  frequency  between  frame  k  and  frame  k  +  1  is  assumed  fixed 
at  the  measured  frequency  +  The  system  amplitude  and  phase  are  simply  set  to  the  values 


measured  on  frame  k  at  the  frequencies  clk+1,  i.e.. 


kR)  and  4>(a>g  +  \  kR).  On  the  other 


hand,  the  excitation  amplitude  is  set  to  zero  to  avoid  unnatural  discontinuties,  since  upon  the 


birth  of  a  track  the  measured  excitation  function  at  on  frame  k  may  be  arbitrary.  To 
initiate  the  interpolation  of  the  excitation  phase  at  the  birth  of  a  track,  the  phase  II 
to  be  the  measured  phase  and  the  startup  phase  on  frame  k  is  defined  to  be 

*R 


k+l 


CO, 


is  defined 


(2.15) 


where  R  is  the  number  of  samples  traversed  in  going  from  frame  k  +  1  back  to  frame  k  (a 
discrete-time  realization  is  assumed  here)  and  where  the  frequency  over  frame  k  +  1  is  assumed 
fixed  at  a>k+l.  This  procedure  insures  that  the  phase  interpolation  constraints  are  satisfied 
initially.  Note  also  that  H1"  provides  an  estimate  of  the  phase  offset  of  (2.2b).  Likewise  the 
startup  time  kR  gives  an  estimate  of  the  sine-wave  onset  time  t3  of  the  Jhh  sine  wave.  This 
estimate  of  the  onset  time  always  falls  on  a  frame  boundary. 


It  was  noted  earlier  in  presenting  the  sine-wave  model  that  the  excitation  phase  consists  of 
two  components,  a  constant  term  and  a  time-varying  term  as  given  in  (2.2)  where  the  time  t  is 
continuously  running.  Similarly,  the  estimate  of  the  excitation  phase  over  the  kth  frame  can  be 
written  in  terms  of  a  constant  and  a  time-varying  term.  Although  this  decomposition  of  the 
excitation  phase  is  not  necessary  for  signal  synthesis,  it  is  developed  here  since  it  will  be  used  in 
later  sections  on  speech  modifications.  Specifically,  the  excitation  phase  over  the  kth  frame  is 
written  as 


^  P  A  /N 

nft(t)  -  J  co2(a)d<7  +  (f>i 


t'2 

t 


=  co^(a)da  +  JT  co^(o)do  +  <pi 


(2.16) 


where  t  falls  in  the  range  [0,  T]  which  defines  the  kth  frame.  Letting  X  denote  the  phase  due  to 
the  time-varying  frequency  accumulated  up  to  frame  k,  i.e., 

0 

0 

=  +  j£a>2(a)d(7  (2.17a) 

and  if  Va(t)  denotes  the  phase  due  to  the  time-varying  frequency  accumulated  over  frame  k,  i.e., 

t 

V*(t)  =  f  W8(a)da  (2.17b) 
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(2.18) 


then  the  excitation  phase  in  (2.16)  can  be  written  as 
«2(t)  =  V2(t)  +  2*  +  <£2 

a 

The  time-varying  component  V2(t)  in  (2.18)  is  obtained  by  integrating  the  instantaneous 
frequency,  the  result  of  which  is  given  by  the  time-varying  component  of  the  cubic  (2.13).  The 
constant  component  consists  of  the  phase  offset  estimate  d>2  and  the  accumulated  phase 
component  2,  which  from  (2.17a)  and  (2.17b)  can  be  obtained  recursively  as 

Sjf1  =  2*  +  V2(T)  (2.19) 

The  constant  component  is  also  the  phase  at  the  right-hand  boundary  of  the  previous  frame  and 
is  given  by  the  value  of  the  parameter  a  in  (2.13).  In^the  case  where  the  track  is  just  initiated,  the 
accumulated  phase  lk  =  0  and  the  constant  a  =  fa  =  flk  of  (2.15). 

The  final  synthetic  waveform  (in  discrete  time)  is  given  by 

L(n) 


A,  A  /\ 

s(n)  =  2.  A2(n)cos[02(n)] 

(2.20a) 

where 

-A  A  A 

A2(n)  =  a2(n)M2(n) 

(2.20b) 

and 

dj(n)  =  02(n)  +  4>2(n) 

(2.20c) 

where  L(n)  is  the  number  of  sine  waves  estimated  at  time  n  and  where,  since  the  functional 
estimates  in  (2.20)  were  derived  above  on  a  frame-by-frame  basis,  the  index  n  -  0,  1,  2  ...  R  -  1 

A  A  A 

is  interpreted  as  the  discrete  time  into  the  kth  frame.  The  functions  a^(n),  M#(n),  flg(n),  and 
4>£(n)  come  from  samples  of  the  continuous-time  functions  in  (2.11)  through  (2.13).  A  block 
diagram  of  the  overall  synthesis  structure  is  illustrated  in  Figure  2-4.  The  dotted  lines  stemming 
from  the  frequency  matcher  indicate  that  the  matched  frequencies  are  required  by  the  linear  and 
cubic  interpolation  procedures.  Note  that  the  functional  contributions  in  (2.20)  are  estimated 
using  only  two  consecutive  frames.  Thus  in  a  computer  implementation  of  (2.20)  the  speech 
waveform  can  be  processed  in  block  fashion,  requiring  the  storage  of  only  one  high-resolution 
short-time  Fourier  transform  and  two  sets  (corresponding  to  two  consecutive  frames)  of  system 
and  excitation  parameters  in  (2.9)  and  (2.10). 

In  order  to  evaluate  the  performance  of  this  new'  approach  to  analysis  and  synthesis  of 
speech,  a  non-real-time  floating  point  computer  simulation  of  this  system  was  developed.  The 
speech  processed  in  the  simulation  was  low-passed  filtered  at  5  kHz,  digitated  at  10  kHz, 
analyzed  at  a  5  ms  frame  rate  and  synthesized  over  a  4  kHz  range.  A  10  ms  analysis  frame, 
however,  was  also  found  adequate  for  reconstruction.  Informal  listening  demonstrated  that  for 
both  male  and  female  speakers,  the  synthetic  speech  was  nearly  perceptually  indistinguishable 
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Figure  2-4 .  Block  diagram  of  sinusoidal  synthesis. 

from  the  original.  As  illustrated  in  Figure  2-5,  the  original  waveform  structure  is  also  essentially 
preserved  in  the  reconstruction.  In  this  example,  representing  a  segment  from  a  male  speaker,  the 
duration  of  the  analysis  window  was  25  ms  and  a  threshold  of  50  peaks  was  specified. 

Although  the  system  was  originally  designed  for  single-speaker  signals,  the  reconstruction 
does  not  break  down  for  multiple  speakers  nor  for  nonspecch  sounds  such  as  music  and  marine 
biologic  sounds,  f  igure  2-6  for  example  depicts  the  reconstruction  of  a  waveform  consisting  of 
the  sum  of  two  speech  signals,  one  from  a  male  and  one  from  a  female  speaker,  burthermore,  in 
the  presence  of  acoustic  background  noise  (down  to  0  dB  S/N)  and  other  interferences  such  as 
music,  the  speech  and  interference  are  virtually  perceptually  identical  to  the  originals.  A  segment 
of  the  reconstruction  for  the  case  of  a  female  speaker  in  a  music  background  is  illustrated  in 
Figure  2-7.  In  these  different  cases,  where  appropriate,  the  number  of  peaks  is  assumed  equal  to 
the  sum  of  the  required  number  of  peaks  for  each  signal  type.  For  example,  for  the  female 
speaker  in  music,  the  total  number  of  peaks  was  set  at  about  80  over  a  4  kHz  range,  40  peaks 
for  each  signal  type. 
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Figure  2-5.  Reconstruction  of  signal  from  single  speaker,  (a)  Original,  (b)  Reconstruction. 


K  - - — - — ' — - — - — — 


20  ms 


(a) 


— v-^ — w — *- — •-- 


20  ms 


(b) 


Figure  2-6.  Reconstruction  of  signal  from  two  speakers,  (a)  Original,  (b)  Reconstruction. 
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Figure  2-7.  Reconstruction  of  speech  in  a  music  background,  (a)  Original,  (b)  Reconstruction 

2.4  Application  to  Speech  Modification 

Sinee  the  analysis/ synthesis  procedure  has  been  expressed  in  terms  of  a  functional  model 
describing  the  behavior  of  each  sine  wave  component,  it  is  now  possible  to  explore  speech 
modifications  simply  by  transforming  each  of  the  functional  descriptors.  In  performing  these 
modifications,  the  excitation  and  vocal  track  amplitude  and  phase  of  each  of  the  sine-wave 
components  will  be  manipulated  in  different  ways.  For  example,  in  time-scale  modification  the 
frequency  trajectories  of  the  excitation  sine  waves  will  be  stretehed  or  eompressed  in  time,  while 
the  vocal  tract  components  will  be  made  to  move  laster  or  slower.  In  pitch  modification,  the 
spacing  between  the  excitation  frequency  trajectories  (which  defines  pitch)  is  made  smaller  or 
larger,  while  preserving  the  voeal  traet  spectral  characteristics.  The  first  of  these  transformations 
to  be  developed  will  be  for  time-scale  modification. 
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3.  TIME-SCALE  MODIFICATION 


The  goal  of  time-scale  modification  is  to  maintain  the  perceptual  quality  of  the  original 
speech  while  changing  the  apparent  rate  of  articulation.  This  requires  that  the  pitch  contour,  and 
thus  the  frequency  trajectories  of  the  excitation,  be  stretched  or  compressed  in  time,  and  that  the 
vocal  tract,  and  thus  the  amplitude  and  phase  of  the  vocal  tract  transfer  function,  be  changed  at 
a  slower  or  faster  rate  than  the  rate  of  normally  spoken  speech.  Thus  both  the  pitch  trajectory 
and  the  spectral  characteristics  of  the  speaker  are  essentially  preserved.  The  synthesis  method  of 
the  previous  section  is  ideally  suited  for  this  transformation  since  it  involves  summing  sine  waves 
composed  of  the  excitation  and  vocal  tract  system  contributions  for  which  explicit  functional 
expressions  have  been  derived. 

In  this  section,  a  method  is  first  presented  for  performing  a  fixed  rate  change  based  on  the 
analysis  and  synthesis  system  of  Section  2.  The  method  is  motivated  by  a  sine-wave  model  for 
time-scale  modification  of  speech.  With  the  fixed  rate-change  case  as  a  stepping  stone,  a  similar 
development  follows  for  time-varying  rate  change  where  the  time  scale  is  continuously  adjusted. 

As  well  as  providing  additional  flexibility,  an  adjustable  time  scale  can  lead  to  a  more  natural 
change  in  the  rate  of  articulation  (than  achieved  with  a  fixed  time-scale  modification)  by  allowing 
the  time  scale  to  adapt  to  various  features  of  the  speech  waveform. 

3.1  Fixed  Rate  Change 

For  an  arbitrary  time-scale  transformation,  the  time  t0  corresponding  to  the  original 
articulation  rate  is  mapped  to  the  transformed  time  Iq  through  the  mapping 

to  =  W(t0)  (3.1) 

For  a  fixed  rate  change  p  ,  the  mapping  (3.1)  is  reduced  to  the  linear  relation  W(t0)  =  pt0.  The 
case  p  >  1  corresponds  to  slowing  down  the  rate  of  articulation  by  means  of  a  time-scale 
expansion,  while  the  case  p  <  1  corresponds  to  speeding  up  the  rate  of  articulation  by  means  of  a 
time-scale  compression.  Speech  “events”  which  take  place  at  a  time  tf)  according  to  the  new  time 
scale  will  have  occurred  at  in  the  original  time  scale.  The  time-scale  transformation  W(  )  is 
illustrated  in  Figure  3-1  for  the  case  p>  1  where  the  time  scale  is  expanded.  The  time  scales  of 
Figure  3-1  can  be  thought  of  as  representing  two  simultaneous  time  counters,  one  running  with 
respect  to  the  original  articulation  rate  and  the  other  with  respect  to  the  transformed  rate. 

In  the  sine-wave  model  for  time-scale  modification,  the  “events”  which  are  time-scaled  are 
the  system  amplitudes  and  phases,  M(co,t)  and  4>(co,t),  and  the  excitation  amplitudes  and 
frequencies,  a2(t)  and  o>j(t),  of  each  underlying  sine  wave.  The  system  parameters  are  manipulated 
such  that  the  vocal  tract  articulators  move  faster  or  slower  in  time.  The  excitation  parameters  are 
modified  so  that  frequency  trajectories  are  stretched  or  compressed  while  maintaining  pitch.  The 
model  for  time-scale  modified  speech  is  illustrated  schematically  in  Figure  3-2  and  represents  a 
simple  modification  of  the  input/output  model  for  unmodified  speech  that  was  depicted  in 
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Figure  3‘1.  Time  warping  with  fixed  rate  change  p>  1. 
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Figure  3-2.  Model  of  time-scale  modification. 

Figure  2-1.  Based  on  the  mathematical  sine-wave  model  for  speech  production  in  (2.5),  Figure  3-2 
is  easily  transformed  to  a  mathematical  model  for  time-scale  modified  speech,  denoted  by  s'(t'), 
and  is  given  by 

L(t') 

s'(t')  =  £  A2  (t')cos  [02  (t')]  (3.2a) 

where 

A'a  (0  =  A2  (p-'t')  =  a2  (p-'O  M2  (p-'t')  (3.2b) 


and 


02  (t')  =  n2  (0  +  ^2  (p-’O  (3.2c) 

with 

n!!  (0  =  X (P'M  +  <t>l  (3. 2d) 

‘8 
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where  the  system  functions  M2(t)  and  <&2(t)  were  defined  in  Section  2.4  along  the  frequency 
trajectories  co2(t),  and  where  the  time  scale  in  (3.2)  corresponds  to  the  transformed  time  scale  of 
Figure  3-1.  The  onset  time  of  each  excitation  sine  wave,  t2,  in  the  original  time  scale  is  mapped 
to  the  new  onset  time,  t2  =  pt2  where  the  excitation  phase  takes  on  the  initial  value  </>2.  Note 
that  with  a  change  of  variables  a  =  p'h,  (3. 2d)  can  be  written  as 


p-'t' 

^4  (0=  J  aijfrjdcf/p*1  +  </>2 
‘8 


=  V2  (p-'O/p-1  +  </>2  (3.2c) 

where  V2(t)  is  the  time-varying  contribution  to  the  excitation  phase  given  in  (2.2b)  and  (2.2c). 

The  initial  phase  offset  in  (3. 2d)  is  consistent  with  preserving  the  pulse-like  nature  of  the 
excitation  function  during  voicing.  To  see  how  the  excitation  function  is  preserved,  consider  an 
excitation  function  given  by  a  periodic  pulse  train  where  the  first  pulse  begins  at  time  =  t0.  This 
excitation  can  be  represented  by  a  sum  of  sinusoidal  components  cos  [£co0(t  -  t0)]u(t  -  t0)  where 
u(t)  is  the  unit  step  function.  Clearly,  in  this  case,  t2  =  t0  and  </>2  =  0  for  all  2  and  thus  the  choice 
of  the  phase  offset  d>2  results  in  the  first  pulse  occuring  at  the  time  t2  =  pt2  in  the  transformed 
time  scale.  Figure  3-3  illustrates  how  a  single  sine-wave  burst  with  initial  phase  offset  d>2  =  0 
changes  as  it  is  time-scale  modified.  When  all  sine  waves  begin  at  the  same  time  t2  but  before  the 
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first  pulse  which  occurs  at  time  t0,  then  the  first  pulse  in  the  transformed  time  scale  will  occur  at 
a  distance  t^  -  tjj  from  tjj  when  the  phase  offset  is  set  as  in  (3. 2d).  In  practice,  however,  a 
periodic  excitation  pulse  train  is  not  impulse-like  and  the  sine-wave  onset  times  may  be  different. 
When  sine  waves  begin  at  different  times  tj,  then  initializing  the  excitation  phase  with  <j>%  in  the 
transformed  time  scale  may  introduce  phase  dispersion  in  the  periodic  pulse  train.  This 
dispersion,  without  knowledge  of  the  first  pulse  position,  may  be  difficult  to  avoid.  (Note  that  in 
the  experimental  system  other  phase  offset  models,  e.g.,  d^/p'1,  were  found  to  result  in  a  pulse¬ 
like  excitation  during  voicing). 

The  modifications  of  (3.2)  correspond  to  stretching  (or  compressing)  the  frequency  tracks  of 
the  excitation  (and  thus  the  pitch  contour),  and  the  slower  (or  faster)  movement  of  the  vocal 
tract  articulators.  As  in  the  model  for  speech  production  without  modification,  the  system 
magnitude  and  phase  need  be  specified  only  along  the  frequency  trajectories  of  the  excitation 
function.  The  functional  mappings  along  these  trajectories  are  illustrated  in  Figure  3-4  for  time- 
scale  expansion  over  a  time  duration  0  ^  t  <  tf.  Both  the  excitation  frequency  and  phase 
trajectories  are  depicted  to  illustrate  the  preservation  of  pitch  and  the  special  modification  of  the 
excitation  phase  function. 

With  this  time-scale  model  as  a  basis,  it  is  straightforward  to  construct  a  time-scale 
modification  system  for  fixed  rate  change  using  the  analysis/ synthesis  structure  of  Section  2.  The 
estimates  (2.11)  through  (2.18),  obtained  in  the  synthesis  stage,  provide  functional  forms  for  the 
parameters  in  (3.2).  Since  these  functional  estimates  were  derived  on  a  frame-by-frame  basis,  it  is 
natural  to  view  the  inverted  time  p_1t'  as  the  time  into  the  kth  frame  within  the  original  time 
scale.  The  estimates  of  the  time-scaled  parameters  (3.2)  can  then  be  obtained  by  evaluating  the 
functional  estimates  at  the  time  (p'*t'  modulo  T)  where  T  is  the  original  frame  duration.  In  a 
discrete-time  implementation,  the  inverted  time  is  given  by  (p'*n'  modulo  R)  where  R  is  the 
number  of  samples  in  the  original  frame  duration,  where  n'  is  the  discrete-  (transformed)  time 
index  and  where  the  sampling  interval  is  assumed  unity.  It  follows  that  the  time-scaled  synthetic 
waveform  (in  discrete  time)  can  be  obtained  over  the  kth  frame  by  replacing  the  model 
parameters  of  (3.2)  by  their  estimates: 


where 


and 


and 


L(n') 


s'(n')  =  ^  An  (n')cos[OJ)  (n')  +  <f>j  (n')] 

(3.3a) 

2=1 

^  /S 

A2  (n)  =  A^p-'n'),*] 

(3.3b) 

4>2  (n)  =  <t>j)  [(p-l n)R] 

(3.3c) 

n)  (n)  =  Vj  [(p-'n')R]/ p'1  +  (££)'  +  4>i 

(3.3d) 
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Figure  3-4.  Functional  mappngs  for  timescale  expansion. 
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with  (Sjf)'  computed  recursively  as 

(2jk+1)'  =  (2jk)'+  Vj(R)/p-I  (3.3e) 

/s 

where  (*)R  denotes  modulo  R.  Recall  from  (2.17)  that  Vg(t)  represents  the  time-varying  phase 
component  which  is  added  to  the  excitation  phase  obtained  up  to  frame  k.  The  accumulated 
phase  due  to  the  time-varying  phase  is  denoted  2k  and  was  obtained  by  adding  the  values  of 
the  time-varying  phase  at  frame  boundaries,  i.e.,  Vg(R).  In  (3.3d)  and  (3.3e),  (£k)'  denotes  this 
same  accumulated  phase  function  but  now  scaled  by  p"*.  The  recursive  computation  of  (2k)  in 
(3.3e)  is  initialized  at  zero.  The  other  parameter  values  in  (3.3)  are  obtained  by  sampling  the 
estimates  Mj(t),  aj(t),  Vj(t),  and  $j}(t)  of  (2.11)  through(2.15)  at  time  tn'  =  (p*In')R  over  each 
frame.  Since  the  parameter  estimates  of  the  unmodified  synthesis  are  available  as  continuous 
functions  of  time,  in  theory,  any  rate  change  is  possible.  However,  arbitrarily  large  or  small  rate 
changes  may  not  be  meaningful  from  a  speech  production  or  perceptual  viewpoint.  A  block 
diagram  of  the  synthesis  component  of  the  time-scale  modification  procedure  represented  by  (3.3) 
is  shown  in  Figure  3-5  where  the  modulo  R  notation  has  been  eliminated.  The  analysis 
component  is  identical  to  that  depicted  in  Figure  2.2. 


Figure  3-5.  Block  diagram  of  uniform  rate-change  system 
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In  a  computer  implementation  of  (3.3),  as  with  the  analysis/ synthesis  system  of  Section  2, 
the  processing  can  be  performed  in  block  fashion.  This  is  a  natural  computational  strategy  since 
each  of  the  underlying  functional  estimates  of  (3.3)  were  derived  using  only  two  consecutive 
frames.  The  samples  of  the  synthetic  modified  waveform  are  generated  in  segments  by  mapping 
back  the  time  index  n  through  one  analysis  frame  to  the  time  tn>  =  (p^n')^.  In  particular,  suppose 
that  the  kth  frame  has  been  entered.  Then  the  amplitude  and  phase  functional  estimates  are  first 
derived  using  (2.11)  through  (2.18).  The  discrete  time  index  n'  (initiated  at  zero)  is  updated  until 
the  boundary  of  an  analysis  frame  is  crossed,  i.e.,  tn*  =  (p"ln')R  “wraps”  back  on  itself.  Until  this 
time,  the  set  of  functional  estimates  is  sampled  at  the  inverse  time  tn>  and  the  modified  waveform 
is  synthesized  as  in  (3.3).  When  a  new  frame  is  entered,  a  new  set  of  functional  estimates  (2.11) 
through  (2.16)  are  sampled  according  to  the  inverse  time  mapping  tn>  =  (p_1n')R.  When  a  new 
frame  is  entered,  the  accumulated  time-varying  excitation  phase  (2k)'  is  updated  according  to 
(3.3e)  where  at  the  onset  of  a  new  sine  wave  (2*f)'  =  0,  Figure  3-6  gives  a  flow  diagram  of  the 
entire  process.  This  block-oriented  implementation  has  the  advantage  that  storage  of  only  two 
consecutive  sets  of  amplitude  and  phase  parameters  is  required. 

To  test  the  reconstruction  procedure  of  (3.3),  time-scale  modifications  ranging  from  a 
compression  of  two  to  an  expansion  of  three  were  implemented  with  the  frame-based  method 
described  above  and  outlined  in  Figure  3-6.  The  example  in  Figure  3-7  illustrates  the  response  of 
the  system  to  a  synthetic  waveform  formed  by  convolving  a  periodic  pulse  train  (with  a  100  Hz 
fundamental)  with  an  exponentially  decaying  response.  Here  a  5  ms  frame  interval,  a  25  ms 
analysis  window,  and  40  peaks  over  a  4  kHz  range  were  used.  The  rate-change  factors  are  1.0, 

1.5,  and  0.5.  In  processing  speech,  the  parameters  such  as  the  number  of  peaks,  sampling  rates, 
window  length,  etc.,  were  set  equal  to  those  used  in  the  system  for  unmodified  speech.  Generally, 
the  rate-changed  synthetic  speech  was  of  high  quality  and  free  of  artifacts  such  as  glitches  and 
reverberation.  Furthermore,  the  natural  quality  and  smoothness  of  the  original  speech  were 
preserved  through  transitions  such  as  voiced /unvoiced  boundaries.  Examples  of  the  synthesis  are 
illustrated  in  Figures  3-8  and  3-9  for  the  case  of  a  single  speaker.  Figure  3-8  depicts  an  example 
of  time-scale  expansion  by  a  factor  of  two  during  an  unvoiced /voiced  transition;  while  Figure  3-9 
illustrates  time-scale  compression  by  a  factor  of  two. 

Although  the  original  system  was  designed  for  a  single  speaker,  as  with  the  baseline  system, 
the  time-scale  modification  system  was  also  found  to  perform  successfully  for  nonspeech  sounds 
and  speech  with  various  types  of  interference.  This  includes  music,  sounds  emitted  by  whales, 
multiple  speakers,  speech  in  acoustic  background  noise  (down  to  about  0  dB  S/N)  and  speech 
with  a  musical  background.  Examples  of  time-scale  modification  of  speech  in  music  and  of  a 
waveform  consisting  of  speech  from  overlapping  male  and  female  speakers  are  depicted  in 
Figures  3-10  and  3-11,  respectively.  The  modified  waveforms  were  natural  sounding  and  without 
artifacts.  The  characteristics  of  F15  cockpit  noise,  for  example,  are  hardly  altered  in  expansion 
and  compression  by  a  factor  of  two.  When  applying  the  system  to  music,  the  time-scaled  music 
seems  to  emanate  from  instruments  played  at  a  faster  or  slower  rate  than  normal.  When 
reconstructing  two  simultaneous  speakers,  the  modified  speech  seems  to  have  been  generated  by 
“linear  analysis/synthesis”  (i.e.,  the  sum  of  the  modified  waveforms  equals  the  modification  of  the 
sum). 
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Figure  3-6.  Flow  diagram  of  computer  implementation  of  uniform  rate-change  system. 
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Figure  3-7,  Time-scale  modification  of  synthetic  waveform,  (a)  Original .  (h)  Reconstruction  (p-  1.0),  (c)  Expansion 
(p  -  1.5),  and  (d)  Compression  (p  -  0.5). 
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Figure  j-S.  Time-scale  expansion  of  speech,  (a)  Original,  and  (b)  Expansion  (p-  2). 
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Figure  3-9.  Time-scale  compression  of  speech,  (a)  Original,  and  (b)  Compression  (p  =  0.5). 
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Figure  3-10.  Time-scale  expansion  of  speech  in  music,  (a)  Original,  and  (b)  Expansion  (p  -  1.5). 
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Figure  3-11  Time- scale  expansion  of  combined  male  and  female  speech,  (a)  Original,  and  (h)  Expansion  (p  -  2). 

3.2  Time-Varying  Rate  Change 

With  the  proposed  sinusoidal  representation,  it  is  also  straightforward  to  model  a  time- 
varying  rate  change  p(t).  Here  the  time-warping  transformation  is  nonlinear  and  is  given  by 

t 

t’=  W(t)=  J  p(r)dr  (3.4) 

O 

where  p(r)  is  the  desired  time-varying  rate  change.  Note  that  for  a  constant  p,  (3.4)  reduces  to 
the  fixed  rate-change  case  (3.1).  In  this  generalization,  each  time-differential  dr  is  scaled  by  a 
different  factor  p(r).  An  example  of  a  nonuniform  time-warp  is  illustrated  in  Figure  3-12  for  a 
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Figure  3-12.  Piecewise  constant  rate  change,  (a)  Rate  Change  Function,  and  (b)  Time-warp. 


piecewise-constant  rate  change.  Speech  events  which  take  place  at  a  time  t'  in  the  new  time  scale 
will  have  occured  at  a  time  t  =  Wm|(t')  in  the  original  time  scale  where  W-l(  )  denotes  the  inverse 
mapping  from  the  new  time  scale  back  to  the  original  time  scale. 

The  speech  model  for  time-varying  rate  change  is  given  by  (3.2)  where  the  inverse  time  p'*t' 
is  now  replaced  by  W''(t'): 


where 


and 


with 


L(t') 

s'(0=  2,  AS  (0  cos|>2  (0] 

(3.5a) 

2=1 

A«  (t')  =  A2  [W  '(t')] 

(3.5b) 

9\  (t')  =  fl£  (0  +  ^[W-'(t')] 

(3.5c) 

ilj  (t')  =  i“co2[W-'(r)]  dr  + 

(3.5d) 

h 
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where  Ag(t)  is  the  composite  amplitude  as(t)Mg(t).  As  with  the  uniform  rate-change  case,  the 
phase  offset  in  (3.5d)  at  time  tj  is  set  so  the  transformation  is  consistent  with  maintaining  the 
shape  of  a  perfectly  periodic  excitation  pulse  train  in  the  vicinity  of  the  transformed  onset  time 
tg.  Figure  3-13  illustrates  nonuniform  time-scaling  applied  to  the  system  phase  function  with  the 
piecewise  constant  p(t)  of  Figure  3-12.  Similar  modifications  are  made  to  the  system  amplitudes 
and  the  excitation  amplitudes  and  frequencies.  Note  that  when  the  rate-change  function  p(t)  is  set 
to  a  constant,  all  of  the  expressions  in  (3.5)  reduce  to  those  of  the  fixed  rate-change  case  (3.2). 

An  important  difference,  however,  in  the  time-varying  rate-change  model  from  the  fixed  rate- 
change  case  is  that,  for  an  arbitrary  W-'(t),  the  modified  excitation  phase  is  not  simply  related  to 
the  original  excitation  phase,  as  in  (3.2e).  Furthermore,  the  inverse  time  mapping  W"*(t)  generally 
is  difficult  to  evaluate  exactly.  Thus  in  developing  an  implementation  for  time-varying  time-scale 
modification  based  on  the  mathematical  model  (3.5),  these  two  issues  must  first  be  addressed. 


Figure  3-13.  System  phase  mapping  for  the  piecewise  constant  rate  change  of  Figure  3.12. 


One  class  of  time-scale  transformations  that  will  be  of  particular  interest  invokes  a  piecewise- 
constant  p(t).  This  condition  on  p(t)  allows  both  exact  time  inversion  and  exact  evaluation  of  the 
integral  equation,  (3.5d).  In  addition,  it  leads  to  a  digital  implementation  which  is  a 
straightforward  extension  of  the  constant  p(t)  case  of  the  previous  section.  An  example  of  a 
piecewise-constant  p(t)  and  a  resulting  time-scale  transformation  was  illustrated  in  Figures  3-12 
and  3-13.  A  more  general  piecewise-functional  representation  for  p(t),  involving  higher-order 
polynomials  (e.g.,  linear  and  quadratic),  will  be  discussed  later  in  this  section. 
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In  showing  that  the  piecewise  constant  condition  leads  to  exact  inversion  of  (3.4),  for 
convenience  in  implementation,  the  time  increment  over  which  the  rate  change  is  held  constant  is 
assumed  equal  to  the  analysis  frame  duration  T,  i.e., 

p(t)  =  constant  for  kT  ^  t  ^  (k  +  1)T  .  (3.6) 

Nevertheless,  there  is  no  restriction  on  how  small  this  increment  can  be  made  (regardless  of  the 
frame  duration)  since  continuous  functional  representations  of  all  model  parameters  are  available. 
(Consequently,  an  arbitrarily  smooth  p(t)  can  be  approximated  as  closely  as  desired.)  Given  the 
condition  in  (3.6),  the  time  mapping  (3.4)  can  then  be  written  for  tk  <  t' <  tk+|  as  (tk  being  the 
beginning  of  the  kth  frame): 

t'  =  I  P(r)dr 

=  ^  +  Jf  p(r)dr 

=  4  +  P(tk)(t  -  tk)  (3-7) 

where  tk  and  t'k  are  the  times  at  the  beginning  of  the  kth  original  frame  and  modified  frame, 
respectively.  The  inverse  time  t  and  thus  the  inverse  mapping  W~l(  )  is  then  given  by 

t  =  W-'(t')  =  tk  +  p->(tk)(t' -  t'k)  .  (3.8) 

Since  the  parameters  of  the  sinusoidal  components  are  available  as  continuous  functions  of  time 
[i.e.,  equations  (2.11)  through  (2.18)]  over  each  analysis  frame  in  the  original  time  scale,  they  can 
always  be  found  at  the  required  inverse  time  of  (3.8).  Note  that  having  modified  the  speech 
waveform  up  to  time  tk,  the  time  into  the  kth  analysis  frame  can  be  written  as 

t=  W-Ht^p-^tkKO  (3.9) 

where  t'  is  interpreted  here  as  the  time  into  the  kth  modified  frame. 

With  this  piecewise  constant  p(t),  the  excitation  phase  function  can  be  also  considerably 
simplified.  Suppose  that  the  excitation  phase  estimate  fig  (t')  has  been  evaluated  up  to  time  tk 
which  when  inverted  corresponds  to  tk  =  kR,  the  beginning  of  the  kth  frame  in  the  original  time 
frame.  Then  with  the  substitution  of  variables,  o-r-  tk,  (3.5d)  can  be  written  for  tk  <  t'  ^  tk+j 
in  terms  of  a  constant  and  time-varying  component: 


ni  (t)  =  ni  (tk)+  J4w-i(r)]d7 

tk 


t  -tk 


=  ^2  (t'k)  +  X  w[W-'(o  +  tk)]do 
0 


(3.10) 
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where  Op  (tk)  represents  the  excitation  phase  estimate  computed  up  to  time  t'k  in  the  new  time 
scale.  Now  using  the  recursion  in  (3.8),  then  (3.10)  can  be  written  as 

l'-tk 

ni  (0  =  ni  (t'k) +  .[  kh +  p-'(tk)o]do  .  (3.i  D 

With  some  further  manipulation,  the  expression  in  (3.11)  can  now  be  written  in  terms  of 
modified  versions  of  the  constant  and  time-varying  excitation  phase  components  of  (2.17). 

Making  the  substitution  of  variables  r  =  p'1(tk)a,  (3.11)  is  given  by 

0“tk)p-i(tk) 

Hi  (0  =  ni  (tk)  +  |  £[tk  +  r]dr/p-'(tk)  (3.12) 

The  integral  expression  in  (3.12)  is  just  a  time-warped  and  scaled  version  of  the  time-varying 
excitation  phase  component  V?(t)  of  (2.17b)  over  the  kth  modified  frame.  Thus  for  the  kth  frame 
(3.12)  can  be  written  as 

fti  (0  =  fij  (tk)  +  Va[p-‘(tk)(t'  -  tk)]/p-<(tk)  (3.13) 

Likewise,  the  constant  term  in  (3.13)  is  similar  to  the  constant  term  in  (2.18)  but  with  scaled 
phase  components.  Specifically,  (3.13)  can  be  expressed  as 

(0  =  Va[p-'(tk)(t'  -  tk)]/p-‘(tk)  +  (2£)'  +  k  .  (3.14a) 

where  (£jf)'  can  be  written  recursively  as 

(4+,)'  =  (S£)'  +  Vj(R)/p-l(kR)  (3.14b) 

with  (2^)  representing  the  accumulated  scaled  time-varying  excitation  phase  component.  Note 
that  in  (3.14)  t'  tk  is  the  time  into  the  kth  modified  frame  and  p*1(tk)(t'  tk)  is  the  time  into  the 
kth  original  frame. 

With  the  inverse  time  recursion  (3.8)  and  the  excitation  phase  recursion  in  (3.14),  a  digital 
implementation  of  a  time-varying  rate  change  system  can  be  realized.  In  this  implementation  the 
inverted  time  is  given  by 

tn'  =  kR  +  p_1(kR)(n  -  nk)  (3. 15) 

where  tn»  denotes  the  inverse  to  the  discrete  time  index  n'  and  where  the  sampling  interval  is 
assumed  normalized  to  unity.  As  in  the  uniform  time-scale  case,  the  estimates  (2.11)  through 
(2.18)  provide  functional  forms  for  the  parameters  in  (3.5).  Since  these  functional  estimates  were 
derived  on  a  frame-by-frame  basis,  it  is  natural  to  view  the  inverted  time  tn»  in  (3.15)  as  the  time 
into  the  kth  frame  within  the  original  time  scale.  Thus  as  before  this  time  is  computed  modulo  R 
which  is  denoted  by  (tn  )R,  R  being  the  number  of  samples  over  the  original  frame  duration.  The 
discretized  version  of  (3.5)  can  then  be  written  as 
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where 


and 


and 


L(n') 

s'(n')  =  i  Aj  (n)  cos[fij  (n)  +  (n)] 

(3.16a) 

2=1 

Aj  (o')  =  A2  [(tn-)R] 

(3.16b) 

(n)  =  <I>2  [(tn-)R] 

(3.16c) 

k  (n')  =  V2  [(tn,)R]/p-‘(kR)  +  (Xf)'  +  ^ 

(3.1 6d) 

with  (S£)'  updated  as 

(SIT1)'  =  (4)'  +  Vj(R)/p-'(kR)  (3. 1 6e) 

and  the  inverse  time  recursion  is  given  by 

tn'  =  kR  +  p'*(kRXn'  -  n^)  (3.160 

The  phase  recursion  in  (3. 16e)  is  initialized  with  (S^)' =  0  at  the  onset  of  each  sine  wave.  Note 
also  that  this  recursion  requires  scaling  the  time-varying  excitation  phase  by  p_,(kR)  which 
changes  on  each  frame.  This  has  the  effect  of  keeping  the  pitch  as  close  to  the  original  as 
possible. 

As  with  uniform  time-scale  modification,  the  discrete-time  implementation  given  in  (3.16)  can 
be  performed  in  block  fashion  where  storage  of  only  two  consecutive  sets  of  amplitude  and  phase 
parameters  is  required.  In  fact,  for  a  particular  frame,  the  operations  represented  by  (3.3)  and 
(3.16)  are  essentially  identical.  A  flow  diagram  of  the  process  is  illustrated  in  Figure  3-14  and  is 
similar  to  that  of  the  uniform  case  given  in  Figure  3-6. 

The  generalized  time-scale  modification  system  (3.16)  was  demonstrated  using  two  long 
speech  passages  (25-30  s),  for  a  male  and  for  a  female  speaker,  with  various  time-varying  rate 
changes.  For  this  experiment  p(r)  was  held  constant  over  the  duration  of  each  5  ms  analysis 
frame.  Decreasing  the  frame  duration  (and  hence  the  time  over  which  the  rate  change  is  held 
constant)  or  increasing  the  frame  duration,  but  bounding  it  by  10  ms,  did  not  noticeably  change 
the  quality.  Both  linear  and  oscillatory  rate  changes  were  performed.  In  the  linear  case  the 
scaling  factor  changed  from  unity  to  0.5  and  to  2.  In  the  oscillatory  case,  the  scaling  factor  was 
modulated  between  0.5  and  2,  one  oscillation  being  about  5  s  in  duration.  The  synthetic  speech 
was  generally  natural-sounding  and  free  of  artifacts. 
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Figure  3-14.  Flow  diagram  of  computer  implementation  of  nonuniform  rate-change  system. 
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This  section  has  developed  just  one  functional  form  for  p(t),  i.e.,  the  piecewise-constant  case. 
Clearly,  this  procedure  can  be  generalized  with  higher  order  representations  for  p(t).  For 
example,  a  piecewise-linear  p(t)  can  be  used  in  (3.7)  which  leads  to  a  time-inversion  formula 
somewhat  more  complicated  than  (3.8).  Furthermore,  since  this  inversion  formula  requires  a 
computation  of  the  form  (t')1  2,  and  since  the  frequency  track  estimate  of  each  sine  wave  is  a 
quadratic  function  of  time,  then  the  integral  expression  in  (3.10)  is  easily  evaluated. 

Consequently,  given  any  desired  smoothly  changing  rate-change  function  p(t),  a  closer  fit  can  be 
obtained  than  offered  by  the  piecewise-constant  case.  More  generally,  it  is  possible  to  obtain  even 
higher-order  approximations  to  an  arbitrary  p(t)  over  each  frame.  However,  since  the  piecewise- 
constant  case  appears  to  yield  satisfactory  quality,  these  higher-order  approximations  may  result 
in  negligible  improvements 


3.3  Feature-Based  Time-Scale  Modification 


The  accuracy  of  the  representation  of  time-scale  modification  of  Section  3.1  is  subject  to  a 
number  of  conditions.  First,  as  with  the  sine-wave  model  for  unmodified  speech,  the  vocal  tract 
and  vocal  cord  parameters  must  be  slowly-varying  relative  to  the  duration  of  the  vocal  tract 
impulse  response.  This  condition  is  generally  satisfied  as  is  evident  from  the  accurate 
reconstruction  of  unmodified  speech.  A  second  condition  is  that  the  rate  change  of  the  actual 
vocal  tract  and  vocal  cord  articulators  be  fixed  as  a  function  of  time.  Generally,  this  condition 
will  not  be  satisfied.  For  example,  when  the  rate  of  articulation  is  reduced  in  natural  speech, 
consonants  and  fricatives  are  generally  slowed  down  less  than  vowels  and  other  more  steady-state 
sounds.  Although,  as  seen  in  Section  3.1,  uniform  rate  change  results  in  generally  high-quality 
synthesis,  the  excessive  slowing  down  of  rapid  speech  sounds  can  render  a  “drunken  man”  quality 
to  the  synthetic  speech.  In  an  attempt  to  reduce  this  effect,  an  adaptive  rate-change  system  was 
implemented.  This  system  was  designed  to  generate  more  natural  sounding  speech  by 
continuously  adapting  the  rate  change  to  the  temporal  characteristics  of  speech. 


The  degree  of  time-compression  or  expansion  in  an  adaptive  system  should  become  a 
function  of  the  rate  at  which  speech  events  change.  Transient  events  should  be  slowed  down  less. 
This  requires  that  some  sort  of  detector  be  developed  which  can  measure  significant  speech 
activity.  One  such  detector  is  the  spectral  derivative22  which  in  this  report  is  defined  using 
normalized  squared  differences  computed  from  the  available  matched  peak  magnitudes. 
Specifically,  the  spectral  derivative  over  two  consecutive  frame  boundaries,  is  given  by 


D(k)  = 


[2(A£  -  a£-')2]'/2 

[£(A^)2]>/2 

2 


(3.17) 


A,  ^ 

where  k  refers  to  the  frame  number  and  AK  is  defined  in  (2.8a).  The  spectral  derivative  as  defined 
in  (3.17)  tends  to  increase  at  voiced/ unvoiced  boundaries,  during  unvoiced  speech  and  during 
consonant  transitions.  This  norm  has  the  additional  advantage  that  it  is  bounded;  in  particular,  it 
is  straightforward  to  show  with  a  simple  geometric  argument  that 


0  ^  D  (k)  ^  2 


(3.18) 
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Thus  when  p(t)  is  made  a  function  of  D(k),  the  range  of  p( t)  can  be  determined  from  (3.18).  In 
slowing  down  the  rate  of  articulation,  when  the  spectral  derivative  is  high  due  to  rapid  speech 
activity,  the  time  scale  should  be  expanded  less  than  when  the  spectral  derivative  is  low.  Thus 
p(t)  is  a  monotonically  decreasing  function  of  D(k).  In  compressing  speech,  the  time  scale  should 
be  compressed  less  when  the  spectral  derivative  is  large,  and  so  in  this  case  p(t)  is  a 
monotonically  increasing  function  of  D(k).  The  functional  relation  should  be  chosen  to  lead  to 
some  desired  average  modified  time  scale.  For  example,  one  functional  relation  between  D(k)  and 
p(t)  is  given  by 

p(kR)  =  3  -  D(k)  for  0  ^  D(k)  ^  2  (3. 19) 

where  p(t)  is  specified  only  at  frame  boundaries  and  is  linearly  interpolated  across  each  frame. 
Under  the  assumption  that  D(k)  is  equally  probable  over  its  range  of  values,  it  can  be  shown 
that  (3.19)  results  in  an  average  rate  change  factor  of  two. 

An  example  of  the  spectral  derivative  defined  in  (3.17)  using  sine-wave  amplitudes,  is  given 
in  Figure  3-15  where  it  is  seen  that  the  spectral  derivative  increases  for  regions  of  frication  and 
other  regions  of  “high  activity”.  The  top  segment  represents  the  fricative  sound  “sh”  in  the  word 


FELL' 


-Yvvsa — • — - — 


'  FROM' 


Figure  3-15.  Speech  segments  from  the  passage ,  “ she  feel  from  the  car,"  with  superimposed  spectral  derivative.  The 
spectral  derivative  has  been  normalized  to  he  between  zero  and  unity  and  was  assumed  constant  over  an  analysis 
frame. 
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“she”,  and  it  can  be  seen  that  the  spectral  derivative  is  high.  In  the  lower  segment,  the  spectral 
derivative  is  high  during  the  plosive  at  the  onset  of  the  “f”  in  word  “fell”,  while  the  spectral 
derivative  decreases  in  traversing  to  the  voiced  sound  “e”. 

In  some  preliminary  experiments,  speech  passages  of  two  seconds  in  duration,  slowed  down 
by  a  fixed  factor  of  two,  were  compared  with  the  same  speech  passages  slowed  down  by  the 
adaptive  system.  In  the  adaptive  system,  the  overall  average  rate  was  kept  approximately  equal  to 
the  fixed  rate  by  applying  the  functional  relation  of  (3.19).  Informal  listening  indicated  that  the 
fricative  and  consonant  regions  sounded  more  natural  when  modified  with  the  adaptive  system. 
Moreover,  the  “drunken  man”  effect  that  occurred  for  the  fixed  rate-change  system  seems  to  be 
reduced.  Additional  more  extensive  listening  tests,  however,  are  required  before  definitive 
conclusions  can  be  drawn. 

The  broad  goal  in  this  work  was  to  generate  modified  speech  that  sounded  more  like  that 
spoken  at  a  slower  or  faster  rate.  Although  the  rapid  transitions  in  speech  can  be  identified  with 
rapid  transitions  in  the  magnitude  of  its  short-time  spectrum22’23,  it  is  not  clear  that  such  a 
speech  “feature”  is  the  only  indicator  or  the  most  accurate  indicator  of  when  to  change  the  rate. 
More  accurate  indicators  may  require  linguistic  knowledge  of  changing  events  that  occur  in 
slowly  and  rapidly  spoken  speech.  Furthermore,  additional  work  would  need  to  be  done  relating 
the  speech  activity  and  the  rate  of  time-scale  modification. 
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4.  FREQUENCY  TRANSFORMATIONS 


Since  the  synthesis  procedure  consists  of  adding  up  the  sinusoidal  waveforms  for  each  of  the 
measured  frequencies,  the  procedure  is  ideally  suited  for  performing  frequency  transformations.  In 
this  section,  two  such  transformations  are  presented.  The  first  transformation,  often  referred  to  as 
frequency  scaling, 17  compresses  or  expands  the  short-time  Fourier  transform  in  frequency  such 
that  both  the  spectral  envelope  and  the  spacing  between  harmonic  components  in  the  spectrum 
(and  thus  the  pitch  contour)  are  scaled.  The  second  transformation  scales  the  pitch  contour  but 
preserves  the  short-time  spectral  envelope.  As  with  time-scale  modification,  it  is  shown  that  pitch 
modification  can  be  performed  with  a  time-varying  scale  factor. 

4.1  Frequency  Scaling 

Frequency  compression  or  expansion  of  the  short-time  Fourier  transform  can  be  represented 
with  a  slight  modification  of  the  excitation  phase  given  in  (2.2).  In  this  procedure,  each  frequency 
track  co^(t)  is  scaled  by  a  desired  factor  /?.  This  results  in  the  modified  excitation  phase: 

t 

Up  (0  =  $  P“>i(T)d  T  +  </>2 

=  /?Vj(t)  +  ( h  (4.1) 

The  original  composite  amplitude  Aj(t)  and  system  phase  4>j(t)  estimates  are  simply  shifted  to  the 
new  location  of  the  frequency  track  ^ajj(t).  These  operations  are  equivalent  to  shifting  the 
excitation  function  to  the  new  frequency  track  locations  and  scaling  of  the  frequency  argument  of 
the  vocal  tract  system  function  to  form  H(/3co,t).  An  illustration  of  frequency  compression  is 
given  in  the  time-frequency  domain  in  Figure  4-1. 

Using  (4.1)  a  discrete-time  implementation  of  a  frequency-scaling  system  can  then  be  realized 
as  a  simple  extension  of  (2.20).  The  resulting  modified  waveform  over  the  kth  frame  is  given  (in 
discrete  time)  by 

L(n) 


s'(n)  =  £  Aj(n)  cos  [flj  (n)  +  (n)] 

(4.2a) 

where 

flj  (n)  =  fiV}(n)  +  (Ssk)'  + 

(4.2b) 

with  (2g )'  computed  recursively  as 

(2,k+,)'  =  (Zjk)'  +  /SVj(R) 

(4.2c) 
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Figure  4.1.  Time- frequency  illustration  of  frequency  compression .  (a)  Original,  (b)  Frequency  compressed \ 

where  the  index  n  =  0,  1,  2  ...  R  -  1  is  interpreted  as  the  discrete  time  into  the  kth  frame  and 
where  the  amplitude  and  phase  functions  are  sampled  versions  of  their  continuous  counterparts  in 
(2.11)  to  (2.18).  This  waveform  modification  corresponds  to  an  expansion  or  compression  of  the 
spectral  envelope  and  a  change  in  pitch.  In  some  applications,  such  as  frequency-scaling  for  the 
hearing  impaired4,  it  may  be  required  to  scale  the  spectrum  nonlinearly  in  frequency.  This 
transformation  can  also  be  easily  realized  with  (4.2)  by  simply  making  the  scale  factor  /3  a 
function  of  frequency  (i.e.,  a  function  of  the  frequency  track  index  2).  Note  that  as  in  the 
previous  sections,  the  computer  implementation  of  (4.2)  can  be  performed  on  a  frame-by-frame 
basis,  thus  requiring  storage  of  only  two  sets  of  amplitude  and  phase  parameters. 

In  one  experiment,  the  spectrum  of  a  speech  signal  was  mapped  from  the  frequency  range 
0-5  kHz  down  to  0-4  kHz  and  thus  the  pitch  was  also  lowered  by  20%.  An  example  of  this 
spectral  transformation  is  given  in  Figure  4-2  where  it  is  seen  that  the  spectral  envelope  has  been 
compressed  and  the  pitch  lowered  (i.e.,  the  harmonic  spacing  is  decreased).  In  a  similar 
experiment,  the  speech  spectrum  was  scaled  from  the  range  0-4  kHz  up  to  0-5  kHz. 
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Figure  4.2.  Frequency  compression  of  the  spectral  magnitude,  (a)  Original,  (h)  Compression. 

4.2  Pitch  Modification 

In  a  simplified  model  for  pitch  modification,  the  excitation  function  of  the  speaker  is 
modified  as  in  the  previous  section,  while  the  spectral  envelope  of  the  speaker’s  vocal  tract  system 
function  is  unchanged.  This  corresponds  to  voice  quality  which  is  similar  to  the  original,  but 
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which  is  characterized  by  a  change  in  pitch.  The  model  for  pitch-modification  is  depicted  in 
Figure  4-3.  This  modification  is  easily  simulated  with  a  system  based  on  the  sinusoidal 
analysis/synthesis  structure  in  Figure  2-2  and  2-4. 

The  first  step  in  frequency  scaling  of  the  excitation  function  requires  that  the  estimated 
frequency  track  of  each  sine-wave  component  be  scaled  by  a  desired  factor  /?  to  generate  a  new 
frequency  track  /?o)2( t).  This  results  (in  discrete  time)  in  an  excitation  phase  given  in  (4.2b)  and 
4.2c).  The  next  step  in  frequency  scaling  the  excitation  function  requires  that  the  excitation 
amplitude  estimate  a2(t)  be  shifted  to  the  new  frequency  track  locations.  It  is  assumed  that  only 
the  frequencies  within  the  resulting  bandwidth  of  the  modified  excitation  function  are  used  in 
synthesis  (i.e.,  to  maintain  the  original  speech  bandwidth  with  frequency  compression  would 
require  high-frequency  regeneration). 

To  preserve  the  shape  of  the  short-time  spectral  envelope,  the  system  amplitudes  and  phases 
must  be  computed  at  the  new  frequency  track  locations,  /?a)2(t).  These  system  amplitude  and 
phase  functions  are  obtained  by  first  sampling  (in  frequency)  the  smooth  system  amplitude  and 
phase  estimates,  derived  from  the  homomorphic  analyzer,  at  the  modified  frequencies  /3a)2 .  These 
values  are  then  linearly  interpolated  across  successive  frame  boundaries  to  generate  the  amplitude 
estimate,  M[/?£o2(t),t],  and  the  phase  estimate,  <i>[/Jaj2(t),t],  along  the  new  frequency  track 
locations.  A  time-frequency  illustration  of  the  excitation  and  system  modifications  required  in  the 
model  for  pitch  modification  is  given  in  Figure  4-4. 

With  the  above  modified  excitation  and  system  components,  the  resulting  modified  discrete¬ 
time  waveform  over  the  kth  frame  is  given  by 

L(n) 


s'(n)  =  £  a2(n)  M2  (n)  cos  [fl2  (n)  +  «J>2  (n)] 

(4.3a) 

where 

and 

M2  (n)  =  M2  [/?w2(n),n] 

(4.3b) 

and 

4>J,  (n)  =  4>2  [/3w2  (n),  n] 

(4.3c) 

fl2  (n)  =  /?v2  (n)  +  (2k)'  +  <£2 

(4.3d) 

with  (2k)'  computed  recursively  as 

(22k+,)'=(Z2k)'  +  /3V2(R) 

(4.3e) 

where  (4.3b)  and  (4.3c)  are  the  discrete-time  forms  of  the  above  continuous-time  magnitude 
estimate  M[/3ai2(t),t]  and  phase  estimate  4>[>3co£(t),t].  The  discrete-time  function  a2(n)  is  a  sampled 
version  of  the  continuous  function  (2.11b).  As  before,  the  discrete-time  index  n  =  0,  1,  2...R-1 
is  viewed  as  the  time  into  each  frame. 
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Figure  4.3.  Sinusoidal  model  for  pitch  modif  ication . 
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The  system  was  first  evaluated  using  the  synthetic  waveform  of  Figure  4-5.  In  this  example 
the  pitch  was  altered  with  factors  of  /3  =  1.5  and  /?  =  0.5  with  reconstructions  using  40  peaks  over 
4  kHz.  With  real  speech,  the  system  was  demonstrated  by  scaling  the  pitch  over  a  range  from 
/3  =  0.5  to  fi  =  2  for  a  variety  of  male  and  female  speakers.  An  example  of  a  pitch  increase  of 
50%  (i.e.,  (S  =  1.5)  is  illustrated  in  Figure  4-6  where  it  can  be  seen  that  the  harmonic  line  spacing 
has  decreased  while  the  spectral  shape  is  maintained.  Figure  4-7  shows  in  the  time  domain  the 
result  of  pitch  lowering  by  20%  where  the  pitch  period  has  increased.  The  synthetic  speech 
resulting  from  this  system  is  smooth  and  without  artifacts  such  as  glitches  or  reverberation. 

Except  for  a  lack  of  formant  shaping,  the  reconstruction  takes  on  the  characteristics  of  higher  or 
lower  pitched  speakers.  With  a  scaling  factor  less  than  unity,  comparison  with  the  original  speech 
was  performed  over  a  frequency  range  lower  than  the  original  frequency  band  to  avoid  the 
requirement  of  high-frequency  regeneration. 

It  is  possible  to  generalize  Equation  (4.1)  used  as  the  basis  for  pitch  scaling  by  making  the 
scaling  factor  (3  a  function  of  time.  The  modified  frequency  tracks  are  given  by  >3(t)ajj(t)  and  the 
resulting  excitation  phase  can  then  be  written  as 

t 

flj  (0  =  §  /?(T)w,j(r)d  r  +  <£2  .  (4.4) 

h 
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Figure  4.5.  Pitch  modification  of  synthetic  waveform,  (a)  Original  (h)  Increase  in  pitch  (P  =  1.5). 
(c)  Decrease  in  pitch  (P  =  0.5). 
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Figure  4  6  Pitch  mollification  of  speech  in  the  frequency  domain,  (a)  Original,  (h)  Pitch-scaled  spectra ( 
magnitude  (p  =  l  5) 
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Figure  4.7.  Piuh  modification  of  speech  in  the  time  domain  (a)  Original,  (b)  Pitch-scaled  (ji  =  0.8). 


One  approach  to  digitally  implementing  (4.4)  is  to  constrain  /?( t)  to  be  piecewise-constant  (as  was 
done  for  time-scale  modification)  and  solve  the  integral  exactly  over  each  frame.  This  results  in 
an  implementation  which  is  identical  to  the  uniform  pitch-scale  case  over  each  frame.  However, 
the  sudden  jumps  in  the  resulting  frequency  contours  are  not  desirable  for  high-quality  speech 
synthesis.  Consequently,  other  approaches  must  be  taken.  In  one  method  for  a  discrete-time 
implementation  of  (4.4),  the  pitch-scale  factor  /3(t)  is  assumed  to  vary  slowly  with  respect  to  the 
sampling  time  interval.  Under  this  condition,  the  integral  in  (4.4)  can  be  approximated  by  the 
recursion 

Ct't  (n)  =  (n  -  1)  +  P( n)co2(n)  (4.5) 

where  here  the  discrete-time  index  n  is  thought  of  as  the  time  into  the  kth  frame  and  where  the 
recursion  is  initialized  with  the  phase  resulting  from  the  previous  k  -  1st  frame.  Another 
approach,  similar  to  that  discussed  at  the  end  of  Section  3.2,  allows  /3(t)  to  take  on  a  higher- 
order  piecewise-functional  form  (e.g.,  linear  or  quadratic),  thus  assuring  the  continuity  of 
frequency.  Since  a»2(r)  in  (4.4)  is  a  quadratic  function  of  time,  then  the  integral  in  (4.4)  can  be 
evaluated  exactly  over  each  frame.  Such  an  approach  avoids  having  to  approximate  the  desired 
frequency  trajectory  and  may  also  avoid  phase  drift  which  can  occur  through  the  recursion  (4.5). 

In  demonstrating  this  system,  a  time-varying  scale  factor  /3(t)  was  applied  to  two  long 
passages  (25-30  s),  one  for  a  male  and  one  for  a  female  speaker.  The  excitation  phase  was 
computed  using  the  recursive  approximation  of  (4.5).  A  piecewise-functional  representation  of 
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j3(t)  has  not  yet  been  simulated.  For  the  female  speaker,  the  pitch  was  modulated  by  ±20% 
corresponding  to  an  oscillation  in  y3(t)  over  the  range  0.8  to  1.2.  For  the  male  speaker,  the  pitch 
was  modulated  with  a  20%  lowering  to  a  50%  increase  corresponding  to  an  oscillation  in  /J(t) 
over  the  range  0.8  to  1.5.  In  both  cases,  the  frequency  range  for  analysis  was  adapted  to  the 
pitch  change  such  that  the  resulting  synthesized  speech  fell  over  a  4  kHz  range.  As  with  a 
uniform  pitch  change,  this  time-varying  modification  resulted  in  natural-sounding  synthetic  speech 
which  was  free  of  artifacts. 

Note  that  the  proposed  model  for  pitch  modification  has  neglected  changes  in  the  system 
spectral  characteristics  which  may  take  place  during  human  pitch  modification.  For  example,  to 
convert  a  female  voice  to  a  male-like  voice,  the  vocal  tract  spectral  envelope  may  need  to  be 
compressed  by  as  much  as  20%  in  frequency.9  The  transformation  procedure  given  here  preserves 
the  system  spectral  envelope  while  altering  the  fundamental  frequency  of  the  excitation  function. 
Hence  it  may  be  desirable  to  develop  a  further  generalization  which  allows  modification  of  the 
system  spectral  amplitude  and  phase  as  well  as  the  excitation. 
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5.  JOINT  TIME-FREQUENCY  MODIFICATIONS 


The  speech  transformation  systems  of  the  previous  sections  can  be  further  generalized  to 
perform  simultaneous  time-scale  modification,  frequency-scaling  and  pitch-scaling.  These  joint 
operations  can  be  carried  out  by  simultaneously  stretching  and  shifting  frequency  tracks.  They 
can  also  be  performed  with  a  continuously  adjustable  rate  change  p(t)  and  frequency  scaling  /?( t). 
By  combining  the  modified  excitation  phases  (3.5d)  and  (4.1),  the  resulting  generalized  excitation 
phase  is  given  in  continous  time  by 


H'j  (tO  =  Jf  )3(r)aJ2[W-|(r)]dr  + 

V 


(5.1) 


where  W“*(t)  is  the  inverse  time  mapping  of  Section  3.2  which  gives  time  values  in  the  original 
time  scale  required  for  time-varying  rate  change.  When  pitch  modification  is  one  of  the  desired 
operations,  it  follows  from  the  previous  section  that  the  system  amplitude  and  phase  will  be 
computed  along  the  modified  frequency  tracks  (3(t)co ^  [W’l(t)]. 

Evaluation  of  (5.1)  in  a  discrete-time  implementation  requires  a  procedure  similar  to  those  of 
Sections  3.2  and  4  2  when  applying  time-varying,  time-scale,  and  pitch  modifications.  One  frame- 
based  method  which  invokes  an  approximation,  but  which  is  easy  to  implement,  uses  the  time- 
inversion  recursion  similar  to  (4.5)  resulting  in  the  recursion  for  the  excitation  phase 


lV2  (n')  -  (n'  1)  +  0[(V)rH  [(tn,)R]  (5.2) 

where  the  inverse  time  tn'  given  by  (3.16f)  is  computed  modulo  R  and  where  the  recursion  is 
initialized  with  the  phase  resulting  from  the  k  1st  frame.  Another  approach  uses  functional 
representations  of  /3(t)  and  p(t)  resulting  in  an  exact  evaluation  of  the  integral  expression  in  (5.1). 
This  approach  allows  for  a  closer  approximation  to  arbitrary  time-varying  functions. 

In  first  demonstrating  the  capability  of  the  system  to  perform  joint  operations,  frequency- 
compression  and  time-scale  expansion  were  performed  simultaneously,  both  by  fixed  factors 
where  (i  -  0.5  and  p  =  0.5.  The  original  values  of  the  excitation  amplitude  and  system  amplitude 
and  phase  are  expanded  in  time  and  shifted  to  the  new  frequency  tracks  0.5cu2(t/2).  Figure  5-1  a,b 
shows  the  original  and  modified  speech  from  a  female  speaker.  The  time  scale  has  been  expanded 
and  the  pitch  period  increased.  The  inverse  to  these  joint  operations  was  also  performed;  i.e., 
simultaneous  time-scale  compression  and  frequency-expansion,  illustrated  in  Figure  5-lc,  was 
performed  on  the  modified  waveform  of  Figure  5- lb  with  fixed  factors  of  p  =  fi  =  2.  These 
operations  effectively  invert  the  original  modifications  thus  resulting  in  an  estimate  of  the  original 
speech  waveform.  Here  the  pitch  period  duration  is  increased  by  a  factor  of  two.  Thus  in  order 
to  maintain  the  original  frequency  resolution,  an  analysis  window  of  twice  the  original  length 
(i.e.,  from  20  ms  to  40  ms)  was  used  in  implementing  the  inverse  operations.  Since  the  time-scale 
was  expanded,  little  time  resolution  was  sacrificed  with  this  increased  analysis  window  length. 
Although  the  perceptual  difference  of  the  reconstruction  from  the  original  is  nearly  unnoticeable 
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Figure  5.1  Joint  frequency  scaling  and  time-scale  modification,  (a)  Original,  (b)  Frequency  compression  and  time-scale 
expansion,  (c)  Inversion  of  Figure  5.J  (b). 

in  this  case,  some  slight  degradation  in  rapid  transitions  was  occasionally  heard,  probably  due  to 
the  smearing  by  sequential  operations.  In  particular,  very  low-pitch  speakers,  requiring  long 
analysis  windows,  appear  to  be  most  sensitive  to  sequential  operations.  The  effect  of  the 
concatenation  of  such  transformations  needs  further  investigation. 

The  flexibility  of  the  system  was  also  demonstrated  with  simultaneous  time-varying  time-scale 
and  pitch-scale  modification.  Here  the  excitation  phase  was  computed  according  to  (5.2),  and  the 

A  a 

system  amplitude  and  phase  are  sampled  along  the  modified  frequency  trajectories  /3(t')a>2[WM(t')] 
(in  discrete-time  by  the  phase  correction  term  /?[(tn,)R]o>J(tn,)R]  (5-2)).  In  one  set  of 

experiments,  the  operations  of  linearly  increasing  and  decreasing  pitch  and  the  time-scale  were 
jointly  applied  to  3  s  male  and  female  passages.  The  time  scale  changed  by  50%  and  the  pitch 
changed  by  20%  in  both  directions.  In  experiments  with  a  30  s  male  passage  and  a  25  s  female 
passage,  these  joint  operations  were  successfully  demonstrated  for  an  oscillatory  pitch  change 
with  a  ±20%  modulation  and  an  oscillatory  time-scale  change  with  a  ±50%  modulation. 

Finally,  similar  joint  operations  were  used  to  enhance  a  passage  which  suffered  from  pitch 
disjunctions,  background  noise  and  a  lack  of  clarity  due  to  the  rapidity  of  the  spoken  speech. 

The  problem  of  pitch  disjunctions  occurs  in  butting  segments  of  speech  together  without  pitch 
contouring  or  smoothing.  In  this  particular  problem,  three  concatenated  passages  of  the  same 
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speaker  were  analyzed.  An  average  pitch  was  computed  within  each  passage.  A  constant  pitch¬ 
scaling  was  then  performed  over  each  passage  so  that  all  three  passages  took  on  about  the  same 
average  pitch.  The  pitch  disjunctions  were  reduced  in  the  synthesized  speech.  Simultaneously  with 
this  transformation,  the  waveform  was  slowed  down  by  a  factor  of  p  =  1.5.  The  pitch  changes 
were  more  evident  and  the  clarity  of  the  passage  improved. 
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6.  DISCUSSION 


In  this  report,  a  sinusoidal  representation  of  the  speech  production  mechanism  was  used  as 
the  basis  for  an  analysis/ synthesis  technique  which  requires  specification  of  amplitudes, 
frequencies,  and  phases  of  vocal  tract  and  excitation  contributions  of  the  component  sine  waves. 
This  system  was  successfully  applied  to  a  variety  of  speech  transformations  including  time-scale 
modification,  frequency  scaling  and  pitch  scaling.  Both  fixed  and  continuously  adjustable  changes 
were  possible.  These  transformations  do  not  require  either  an  explicit  pitch  estimate  or  voiced/ 
unvoiced  decisions.  Although  this  waveform  representation  was  originally  designed  for  single¬ 
speaker  signals,  it  is  equally  capable  of  reconstructing  and  modifying  nonspeech  signals  such  as 
music,  multiple  speakers,  marine  biologic  sounds  and  speech  recorded  in  the  presence  of  noise 
and  musical  backgrounds.  Background  interferences  are  modified  along  with  the  speech,  and  were 
found  to  be  synthesized  naturally  and  without  artifacts. 

The  main  computational  load  of  this  system  is  in  the  evaluation  of  a  high-resolution 
spectrum  using  a  512-point  FFT,  homomorphic  filtering  for  vocal  tract  parameter  estimation,  and 
the  generation  of  the  sinewave  components.  Other  operations  such  as  frequency  estimation  by 
peak-picking,  frequency  matching,  and  phase  interpolation  and  unwrapping  add  an  insignificant 
amount  to  the  overall  computational  load.  A  study  of  the  computational  complexity  and 
feasibility  of  this  system  is  being  made  by  way  of  a  16-bit  real-time  (fixed-point)  implementation 
on  Lincoln  Laboratory’s  Digital  Signal  Processors  (LDSP)24.  The  current  status  of  the 
implementation  indicates  that  the  method,  with  an  analysis  frame  rate  of  20  ms,  can  be  realized 
in  real-time  using  a  few  commercially  available  signal  processing  chips.  A  number  of  proposed 
multiple-processor  architecture  designs  appear  to  be  feasible  both  in  terms  of  cost  and  size.  More 
detailed  design  studies  of  processor  architectures  for  the  sinusoidal  analysis/synthesis  system  are 
in  progress. 

It  should  be  noted  that  an  earlier  “magnitude-only”  version25  of  the  sine-wave-based  system 
provided  an  important  stepping  stone  to  the  speech  transformation  system  in  this  report.  The 
baseline  analysis/synthesis  for  this  system  did  not  rely  on  a  measurement  of  phase  nor  did  it  use 
the  speech  production  model;  rather  the  sine-wave  phase  function  was  obtained  by  integrating  a 
frequency  trajectory  formed  by  linearly  interpolating  matched  frequencies  over  consecutive 
frames.  While  the  transformed  speech  was  very  intelligible  and  free  of  artifacts,  it  was  perceived 
as  being  different  in  quality  from  the  original;  the  differences  were  more  pronounced  for  low- 
pitched  (i.e.,  pitch  less  than  about  100  Hz)  speakers.  When  the  magnitude-only  system  was  used 
to  synthesize  noisy  speech,  the  noisy  speech  took  on  a  tonal  quality  that  was  unnatural  and 
annoying.  The  use  of  the  measured  sine-wave  phase  and  the  introduction  of  the  speech 
production  model  resulted  in  a  much  improved  quality  in  the  modified  speech. 

In  spite  of  the  initial  success  of  the  system  in  this  report,  there  remain  a  number  of  areas  of 
possible  improvement  and  some  interesting  questions  for  further  exploration.  For  example, 
although  the  modified  synthetic  waveforms  tend  to  be  speech-like  in  appearance,  some  structure 
of  the  original  waveform  is  lost,  due  possibly  to  the  minimum  phase  vocal  tract  assumption 
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implicit  in  the  homomorphic  analyzer  and  to  dispersion  of  the  excitation  function.  Thus  alternate 
methods  of  estimating  the  system  phase  and  procedures  for  making  the  excitation  function  less 
dispersive  might  be  sought.  One  of  the  more  interesting  questions  yet  unanswered  involves  the 
good  performance  of  the  system  in  the  face  of  nonspeech  signals  and  speech  in  interference. 

Since  the  model  underlying  the  modification  system  is  based  on  the  speech  production 
mechanism,  the  robustness  of  the  system  to  such  a  signal  class  is  not  understood.  Finally,  this 
report  has  only  touched  on  the  invertibility  of  the  speech  transformations,  a  property  which  may 
have  considerable  practical  importance.  For  example,  bandwidth  reduction  prior  to  waveform 
coding  could  be  achieved  using  the  frequency  compression  transformation.  However,  it  is 
required  that  the  coded  speech  be  expanded  back  to  its  original  bandwidth.  Since  this  inversion 
process  requires  that  speech  analysis  take  place  over  a  20-30  s  duration  on  the  frequency- 
compressed  waveform,  it  may  be  difficult  to  achieve  the  frequency  resolution  required  for 
adequate  reconstruction. 
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